<a href="https://github.com/dd-consulting">
<img src="../reference/GZ_logo.png" width="60" align="right">
</a>
<h1>
One-Stop Analytics: Predictive Modeling (Regression Models)
</h1>
Autism spectrum disorder (ASD) is a developmental disability that can cause significant social, communication and behavioral challenges. CDC is committed to continuing to provide essential data on ASD, search for factors that put children at risk for ASD and possible causes, and develop resources that help identify children with ASD as early as possible.
Doctors cited better awareness among parents and preschool teachers, leading to early referrals for diagnosis.
https://www.gov.sg/news/content/today-online-more-preschoolers-diagnosed-with-developmental-issues
<a href="">
<img src="" width="60" align="right">
</a>
https://www.cdc.gov/ncbddd/autism/data/index.html
<a href="">
<img src="" width="750" align="center">
</a>
if(!require(repr)){install.packages("repr")}
## Loading required package: repr
library("repr") # Show graphs in-line notebook
Obtain current R working directory
getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"
Set new R working directory
# setwd("/media/sf_vm_shared_folder/git/DDC/DDC-ASD/model_R")
# setwd('~/Desktop/admin-desktop/vm_shared_folder/git/DDC-ASD/model_R')
getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"
<a href="">
<img src="" width="750" align="center">
</a>
<h3>
Linear Model: Simple Linear Regression (SLR) - Workshop Task
</h3>
Workshop Task:
Use Case Data: “../dataset/ADV_ASD_State_R.csv”
Read in CSV data, storing as R dataframe
# Read back in above saved file:
ASD_State <- read.csv("../dataset/ADV_ASD_State_R.csv")
# Convert Year_Factor to ordered.factor
ASD_State$Year_Factor <- factor(ASD_State$Year_Factor, ordered = TRUE)
ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE,
levels=c("Low", "High"))
ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE,
levels=c("Low", "Medium", "High", "Very High"))
head(ASD_State)
## State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1 AZ 45322 6.5 5.8 7.3 2000 addm
## 2 GA 43593 6.5 5.8 7.3 2000 addm
## 3 MD 21532 5.5 4.6 6.6 2000 addm
## 4 NJ 29714 9.9 8.9 11.1 2000 addm
## 5 SC 24535 6.3 5.4 7.4 2000 addm
## 6 WV 23065 4.5 3.7 5.5 2000 addm
## Source_Full1 State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network Arizona
## 2 Autism & Developmental Disabilities Monitoring Network Georgia
## 3 Autism & Developmental Disabilities Monitoring Network Maryland
## 4 Autism & Developmental Disabilities Monitoring Network New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network West Virginia
## State_Full2 Numerator_ASD Numerator_NonASD Proportion
## 1 AZ-Arizona 295 45027 0.006508980
## 2 GA-Georgia 283 43310 0.006491868
## 3 MD-Maryland 118 21414 0.005480215
## 4 NJ-New Jersey 294 29420 0.009894326
## 5 SC-South Carolina 155 24380 0.006317506
## 6 WV-West Virginia 104 22961 0.004508996
## Chi_Wilson_Corrected_Lower.CI Chi_Wilson_Corrected_Upper.CI Male.Prevalence
## 1 5.798905 7.303948 9.7
## 2 5.769431 7.302595 11.0
## 3 4.557351 6.583638 8.6
## 4 8.814705 11.102544 14.8
## 5 5.381662 7.411085 9.3
## 6 3.703408 5.483723 6.6
## Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1 8.5 11.1 3.2 2.5 4.0
## 2 9.7 12.4 2.0 1.5 2.7
## 3 7.1 10.6 2.2 1.5 2.7
## 4 13.0 16.8 4.3 3.3 5.5
## 5 7.8 11.2 3.3 2.4 4.5
## 6 5.2 8.2 2.4 1.6 3.5
## Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1 8.6 7.5
## 2 7.9 6.7
## 3 4.9 3.8
## 4 11.3 9.5
## 5 6.5 5.2
## 6 4.5 3.7
## Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1 9.8 7.3
## 2 9.3 5.3
## 3 6.4 6.1
## 4 13.3 10.6
## 5 8.2 5.8
## 6 5.5 NA
## Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1 4.4 12.2 NA
## 2 4.4 6.4 NA
## 3 4.7 8.0 NA
## 4 8.5 13.1 NA
## 5 4.5 7.3 NA
## 6 NA NA NA
## Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## State_Region Source_UC
## 1 D8 Mountain ADDM
## 2 D5 South Atlantic ADDM
## 3 D5 South Atlantic ADDM
## 4 D2 Middle Atlantic ADDM
## 5 D5 South Atlantic ADDM
## 6 D5 South Atlantic ADDM
## Source_Full3 Prevalence_Risk2
## 1 ADDM Autism & Developmental Disabilities Monitoring Network High
## 2 ADDM Autism & Developmental Disabilities Monitoring Network High
## 3 ADDM Autism & Developmental Disabilities Monitoring Network High
## 4 ADDM Autism & Developmental Disabilities Monitoring Network High
## 5 ADDM Autism & Developmental Disabilities Monitoring Network High
## 6 ADDM Autism & Developmental Disabilities Monitoring Network Low
## Prevalence_Risk4 Year_Factor
## 1 Medium 2000
## 2 Medium 2000
## 3 Medium 2000
## 4 Medium 2000
## 5 Medium 2000
## 6 Low 2000
# Filter [ Source: ADDM ], including only two clomuns for SLR:
# Dependent variable: Prevalence
# independent variable: Year
ASD_State_4_SLR = subset(ASD_State, Source_UC == 'ADDM', select = c(Prevalence, Year))
#
dim(ASD_State_4_SLR)
## [1] 86 2
head(ASD_State_4_SLR)
## Prevalence Year
## 1 6.5 2000
## 2 6.5 2000
## 3 5.5 2000
## 4 9.9 2000
## 5 6.3 2000
## 6 4.5 2000
SLR Workshop Task: 1. a. Graph the data in a scatterplot to determine if there is a possible linear relationship.
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
plot(ASD_State_4_SLR$Year, ASD_State_4_SLR$Prevalence)
SLR Workshop Task: 2. b. Compute and interpret the linear correlation coefficient, r.
Compute correlaion coefficient
cor(ASD_State_4_SLR$Year, ASD_State_4_SLR$Prevalence)
## [1] 0.7224098
Apply correlation test (two tail: != 0)
cor.test(ASD_State_4_SLR$Year, ASD_State_4_SLR$Prevalence)
##
## Pearson's product-moment correlation
##
## data: ASD_State_4_SLR$Year and ASD_State_4_SLR$Prevalence
## t = 9.5753, df = 84, p-value = 4.13e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6027995 0.8102653
## sample estimates:
## cor
## 0.7224098
Apply correlation test (one tail: > 0)
cor.test(ASD_State_4_SLR$Year, ASD_State_4_SLR$Prevalence, alternative = "greater")
##
## Pearson's product-moment correlation
##
## data: ASD_State_4_SLR$Year and ASD_State_4_SLR$Prevalence
## t = 9.5753, df = 84, p-value = 2.065e-15
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
## 0.6243611 1.0000000
## sample estimates:
## cor
## 0.7224098
SLR Workshop Task: 3. c. Determine the regression equation for the data.
fit_model = lm(formula = Prevalence ~ Year, data = ASD_State_4_SLR)
print(fit_model)
##
## Call:
## lm(formula = Prevalence ~ Year, data = ASD_State_4_SLR)
##
## Coefficients:
## (Intercept) Year
## -1737.8464 0.8714
SLR Workshop Task: 4. d. Graph the regression equation and the data points.
plot(ASD_State_4_SLR$Year, ASD_State_4_SLR$Prevalence)
abline(fit_model, col="blue", lwd=2)
SLR Workshop Task: 5. e. Identify potential influential observations (outliers).
# library(repr)
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=6)
par(mfrow=c(2, 2))
plot(fit_model)
par(mfrow=c(1, 1))
[ Tips ] We notice:
SLR Workshop Task: 6. f. At the 5% significance level, do the data provide sufficient evidence to conclude that the slope of the population regression line is not 0 and, hence, that [ Year ] is useful as a predictor of ASD [ Prevalence ]?
summary(fit_model)
##
## Call:
## lm(formula = Prevalence ~ Year, data = ASD_State_4_SLR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9888 -2.7032 -0.0104 1.7397 12.1255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.738e+03 1.827e+02 -9.513 5.51e-15 ***
## Year 8.714e-01 9.101e-02 9.575 4.13e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.688 on 84 degrees of freedom
## Multiple R-squared: 0.5219, Adjusted R-squared: 0.5162
## F-statistic: 91.69 on 1 and 84 DF, p-value: 4.13e-15
[ Tips ] We notice:
SLR Workshop Task: 7. g. Obtain the residuals and create a residual plot. Decide whether it is reasonable to consider that the assumptions for regression analysis are met by the variables in questions.
# library(repr)
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=6)
par(mfrow=c(2, 2))
plot(fit_model)
par(mfrow=c(1, 1))
[ Tips ] We notice:
Based on Residual vs Fitted, Sacle-Location, and Normal Q-Q charts, the residuals (vs fitted) are following linear assumption, with slightly “fan-shape” at larger Year values (Heteroscedasticity). https://statisticsbyjim.com/regression/heteroscedasticity-regression/
We are to explore polynomial regression method for this issue later.
SLR Workshop Task: 8. h. Compute and interpret the coefficient of determination, \(R^2\).
summary(fit_model)
##
## Call:
## lm(formula = Prevalence ~ Year, data = ASD_State_4_SLR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9888 -2.7032 -0.0104 1.7397 12.1255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.738e+03 1.827e+02 -9.513 5.51e-15 ***
## Year 8.714e-01 9.101e-02 9.575 4.13e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.688 on 84 degrees of freedom
## Multiple R-squared: 0.5219, Adjusted R-squared: 0.5162
## F-statistic: 91.69 on 1 and 84 DF, p-value: 4.13e-15
[ Tips ] We notice:
\(R^2\) is 0.5219
Adjusted \(R^2\) is 0.5162
SLR Workshop Task: 9. i. Find the predicted ASD Prevalence of future Year.
future_year = 2025
newdata = data.frame(Year = future_year)
predict(fit_model,newdata)
## 1
## 26.75998
#
cat("Predicted ASD Prevalence of Year [", future_year, "] is", round(predict(fit_model,newdata), 1), "per 1,000 Children")
## Predicted ASD Prevalence of Year [ 2025 ] is 26.8 per 1,000 Children
SLR Workshop Task: 10. j. Determine a 95% confidence interval for the predicted ASD Prevalence.
predict(fit_model, newdata, interval = "predict")
## fit lwr upr
## 1 26.75998 18.72351 34.79645
cat("\nPredicted ASD Prevalence of Year [", future_year, "] (95% Upper CI) is",
round(predict(fit_model,newdata, interval = "predict")[3], 1), "per 1,000 Children")
##
## Predicted ASD Prevalence of Year [ 2025 ] (95% Upper CI) is 34.8 per 1,000 Children
cat("\nPredicted ASD Prevalence of Year [", future_year, "] (95% Lower CI) is",
round(predict(fit_model,newdata, interval = "predict")[2], 1), "per 1,000 Children")
##
## Predicted ASD Prevalence of Year [ 2025 ] (95% Lower CI) is 18.7 per 1,000 Children
<h3>
Quiz:
</h3>
<p>
Create Prevalence ~ Year SLR model for Data Source: SPED
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<a href="">
<img src="" width="750" align="center">
</a>
<h3>
Linear Model: Multiple Linear Regression (MLR) - Workshop Task
</h3>
Workshop Task:
MLR Workshop Task: 1. a. Get the data.
Use Case Data: “../dataset/ADV_ASD_State_R.csv”
Read in CSV data, storing as R dataframe
# Read back in above saved file:
# ASD_State <- read.csv("../dataset/ADV_ASD_State_R.csv")
# ASD_State$Year_Factor <- factor(ASD_State$Year_Factor, ordered = TRUE) # Convert Year_Factor to ordered.factor
# ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE, levels=c("Low", "High"))
# ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE, levels=c("Low", "Medium", "High", "Very High"))
head(ASD_State)
## State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1 AZ 45322 6.5 5.8 7.3 2000 addm
## 2 GA 43593 6.5 5.8 7.3 2000 addm
## 3 MD 21532 5.5 4.6 6.6 2000 addm
## 4 NJ 29714 9.9 8.9 11.1 2000 addm
## 5 SC 24535 6.3 5.4 7.4 2000 addm
## 6 WV 23065 4.5 3.7 5.5 2000 addm
## Source_Full1 State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network Arizona
## 2 Autism & Developmental Disabilities Monitoring Network Georgia
## 3 Autism & Developmental Disabilities Monitoring Network Maryland
## 4 Autism & Developmental Disabilities Monitoring Network New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network West Virginia
## State_Full2 Numerator_ASD Numerator_NonASD Proportion
## 1 AZ-Arizona 295 45027 0.006508980
## 2 GA-Georgia 283 43310 0.006491868
## 3 MD-Maryland 118 21414 0.005480215
## 4 NJ-New Jersey 294 29420 0.009894326
## 5 SC-South Carolina 155 24380 0.006317506
## 6 WV-West Virginia 104 22961 0.004508996
## Chi_Wilson_Corrected_Lower.CI Chi_Wilson_Corrected_Upper.CI Male.Prevalence
## 1 5.798905 7.303948 9.7
## 2 5.769431 7.302595 11.0
## 3 4.557351 6.583638 8.6
## 4 8.814705 11.102544 14.8
## 5 5.381662 7.411085 9.3
## 6 3.703408 5.483723 6.6
## Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1 8.5 11.1 3.2 2.5 4.0
## 2 9.7 12.4 2.0 1.5 2.7
## 3 7.1 10.6 2.2 1.5 2.7
## 4 13.0 16.8 4.3 3.3 5.5
## 5 7.8 11.2 3.3 2.4 4.5
## 6 5.2 8.2 2.4 1.6 3.5
## Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1 8.6 7.5
## 2 7.9 6.7
## 3 4.9 3.8
## 4 11.3 9.5
## 5 6.5 5.2
## 6 4.5 3.7
## Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1 9.8 7.3
## 2 9.3 5.3
## 3 6.4 6.1
## 4 13.3 10.6
## 5 8.2 5.8
## 6 5.5 NA
## Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1 4.4 12.2 NA
## 2 4.4 6.4 NA
## 3 4.7 8.0 NA
## 4 8.5 13.1 NA
## 5 4.5 7.3 NA
## 6 NA NA NA
## Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## State_Region Source_UC
## 1 D8 Mountain ADDM
## 2 D5 South Atlantic ADDM
## 3 D5 South Atlantic ADDM
## 4 D2 Middle Atlantic ADDM
## 5 D5 South Atlantic ADDM
## 6 D5 South Atlantic ADDM
## Source_Full3 Prevalence_Risk2
## 1 ADDM Autism & Developmental Disabilities Monitoring Network High
## 2 ADDM Autism & Developmental Disabilities Monitoring Network High
## 3 ADDM Autism & Developmental Disabilities Monitoring Network High
## 4 ADDM Autism & Developmental Disabilities Monitoring Network High
## 5 ADDM Autism & Developmental Disabilities Monitoring Network High
## 6 ADDM Autism & Developmental Disabilities Monitoring Network Low
## Prevalence_Risk4 Year_Factor
## 1 Medium 2000
## 2 Medium 2000
## 3 Medium 2000
## 4 Medium 2000
## 5 Medium 2000
## 6 Low 2000
names(ASD_State)
## [1] "State"
## [2] "Denominator"
## [3] "Prevalence"
## [4] "Lower.CI"
## [5] "Upper.CI"
## [6] "Year"
## [7] "Source"
## [8] "Source_Full1"
## [9] "State_Full1"
## [10] "State_Full2"
## [11] "Numerator_ASD"
## [12] "Numerator_NonASD"
## [13] "Proportion"
## [14] "Chi_Wilson_Corrected_Lower.CI"
## [15] "Chi_Wilson_Corrected_Upper.CI"
## [16] "Male.Prevalence"
## [17] "Male.Lower.CI"
## [18] "Male.Upper.CI"
## [19] "Female.Prevalence"
## [20] "Female.Lower.CI"
## [21] "Female.Upper.CI"
## [22] "Non.hispanic.white.Prevalence"
## [23] "Non.hispanic.white.Lower.CI"
## [24] "Non.hispanic.white.Upper.CI"
## [25] "Non.hispanic.black.Prevalence"
## [26] "Non.hispanic.black.Lower.CI"
## [27] "Non.hispanic.black.Upper.CI"
## [28] "Hispanic.Prevalence"
## [29] "Hispanic.Lower.CI"
## [30] "Hispanic.Upper.CI"
## [31] "Asian.or.Pacific.Islander.Prevalence"
## [32] "Asian.or.Pacific.Islander.Lower.CI"
## [33] "Asian.or.Pacific.Islander.Upper.CI"
## [34] "State_Region"
## [35] "Source_UC"
## [36] "Source_Full3"
## [37] "Prevalence_Risk2"
## [38] "Prevalence_Risk4"
## [39] "Year_Factor"
# Filter to include relevant clomuns for MLR:
# Dependent variable: Prevalence
# independent variable: Let's include all at the moment
ASD_State_4_MLR = ASD_State
#
dim(ASD_State_4_MLR)
## [1] 1692 39
head(ASD_State_4_MLR)
## State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1 AZ 45322 6.5 5.8 7.3 2000 addm
## 2 GA 43593 6.5 5.8 7.3 2000 addm
## 3 MD 21532 5.5 4.6 6.6 2000 addm
## 4 NJ 29714 9.9 8.9 11.1 2000 addm
## 5 SC 24535 6.3 5.4 7.4 2000 addm
## 6 WV 23065 4.5 3.7 5.5 2000 addm
## Source_Full1 State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network Arizona
## 2 Autism & Developmental Disabilities Monitoring Network Georgia
## 3 Autism & Developmental Disabilities Monitoring Network Maryland
## 4 Autism & Developmental Disabilities Monitoring Network New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network West Virginia
## State_Full2 Numerator_ASD Numerator_NonASD Proportion
## 1 AZ-Arizona 295 45027 0.006508980
## 2 GA-Georgia 283 43310 0.006491868
## 3 MD-Maryland 118 21414 0.005480215
## 4 NJ-New Jersey 294 29420 0.009894326
## 5 SC-South Carolina 155 24380 0.006317506
## 6 WV-West Virginia 104 22961 0.004508996
## Chi_Wilson_Corrected_Lower.CI Chi_Wilson_Corrected_Upper.CI Male.Prevalence
## 1 5.798905 7.303948 9.7
## 2 5.769431 7.302595 11.0
## 3 4.557351 6.583638 8.6
## 4 8.814705 11.102544 14.8
## 5 5.381662 7.411085 9.3
## 6 3.703408 5.483723 6.6
## Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1 8.5 11.1 3.2 2.5 4.0
## 2 9.7 12.4 2.0 1.5 2.7
## 3 7.1 10.6 2.2 1.5 2.7
## 4 13.0 16.8 4.3 3.3 5.5
## 5 7.8 11.2 3.3 2.4 4.5
## 6 5.2 8.2 2.4 1.6 3.5
## Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1 8.6 7.5
## 2 7.9 6.7
## 3 4.9 3.8
## 4 11.3 9.5
## 5 6.5 5.2
## 6 4.5 3.7
## Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1 9.8 7.3
## 2 9.3 5.3
## 3 6.4 6.1
## 4 13.3 10.6
## 5 8.2 5.8
## 6 5.5 NA
## Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1 4.4 12.2 NA
## 2 4.4 6.4 NA
## 3 4.7 8.0 NA
## 4 8.5 13.1 NA
## 5 4.5 7.3 NA
## 6 NA NA NA
## Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## State_Region Source_UC
## 1 D8 Mountain ADDM
## 2 D5 South Atlantic ADDM
## 3 D5 South Atlantic ADDM
## 4 D2 Middle Atlantic ADDM
## 5 D5 South Atlantic ADDM
## 6 D5 South Atlantic ADDM
## Source_Full3 Prevalence_Risk2
## 1 ADDM Autism & Developmental Disabilities Monitoring Network High
## 2 ADDM Autism & Developmental Disabilities Monitoring Network High
## 3 ADDM Autism & Developmental Disabilities Monitoring Network High
## 4 ADDM Autism & Developmental Disabilities Monitoring Network High
## 5 ADDM Autism & Developmental Disabilities Monitoring Network High
## 6 ADDM Autism & Developmental Disabilities Monitoring Network Low
## Prevalence_Risk4 Year_Factor
## 1 Medium 2000
## 2 Medium 2000
## 3 Medium 2000
## 4 Medium 2000
## 5 Medium 2000
## 6 Low 2000
MLR Workshop Task: 2. b. Discover and visualize the data to gain insights (Is there missing Value in the dataframe, then how to deal with the missing value).
summary(ASD_State_4_MLR)
## State Denominator Prevalence Lower.CI
## AZ : 40 Min. : 965 Min. : 0.400 Min. : 0.30
## MD : 40 1st Qu.: 107151 1st Qu.: 3.100 1st Qu.: 2.80
## GA : 39 Median : 353328 Median : 5.600 Median : 5.30
## MO : 39 Mean : 604689 Mean : 7.191 Mean : 6.42
## NC : 39 3rd Qu.: 767928 3rd Qu.: 9.200 3rd Qu.: 8.60
## WI : 39 Max. :5824922 Max. :42.700 Max. :29.90
## (Other):1456
## Upper.CI Year Source
## Min. : 0.600 Min. :2000 addm: 86
## 1st Qu.: 3.300 1st Qu.:2003 medi:655
## Median : 5.900 Median :2007 nsch: 98
## Mean : 8.262 Mean :2007 sped:853
## 3rd Qu.: 9.700 3rd Qu.:2011
## Max. :69.000 Max. :2016
##
## Source_Full1
## Autism & Developmental Disabilities Monitoring Network: 86
## Medicaid :655
## National Survey of Children's Health : 98
## Special Education Child Count :853
##
##
##
## State_Full1 State_Full2 Numerator_ASD
## Arizona : 40 AZ-Arizona : 40 Min. : 11.0
## Maryland : 40 MD-Maryland : 40 1st Qu.: 463.8
## Georgia : 39 GA-Georgia : 39 Median : 1478.5
## Missouri : 39 MO-Missouri : 39 Mean : 3668.3
## North Carolina: 39 NC-North Carolina: 39 3rd Qu.: 4147.5
## Wisconsin : 39 WI-Wisconsin : 39 Max. :79041.0
## (Other) :1456 (Other) :1456
## Numerator_NonASD Proportion Chi_Wilson_Corrected_Lower.CI
## Min. : 947 Min. :0.0004079 Min. : 0.2647
## 1st Qu.: 106377 1st Qu.:0.0030973 1st Qu.: 2.8266
## Median : 352088 Median :0.0055998 Median : 5.2822
## Mean : 601021 Mean :0.0071888 Mean : 6.5008
## 3rd Qu.: 761045 3rd Qu.:0.0092065 3rd Qu.: 8.6202
## Max. :5795215 Max. :0.0424597 Max. :32.6669
##
## Chi_Wilson_Corrected_Upper.CI Male.Prevalence Male.Lower.CI Male.Upper.CI
## Min. : 0.5491 Min. : 5.00 Min. : 4.100 Min. : 6.20
## 1st Qu.: 3.3186 1st Qu.:11.30 1st Qu.: 9.475 1st Qu.:13.12
## Median : 5.9473 Median :16.90 Median :14.800 Median :19.05
## Mean : 8.0617 Mean :18.28 Mean :16.035 Mean :20.90
## 3rd Qu.: 9.7141 3rd Qu.:23.93 3rd Qu.:21.750 3rd Qu.:26.50
## Max. :55.4426 Max. :45.50 Max. :42.400 Max. :48.90
## NA's :1606 NA's :1606 NA's :1606
## Female.Prevalence Female.Lower.CI Female.Upper.CI
## Min. : 1.000 Min. : 0.600 Min. : 1.700
## 1st Qu.: 2.825 1st Qu.: 1.900 1st Qu.: 3.825
## Median : 3.700 Median : 2.800 Median : 5.200
## Mean : 4.248 Mean : 3.229 Mean : 5.646
## 3rd Qu.: 5.575 3rd Qu.: 4.475 3rd Qu.: 6.875
## Max. :12.300 Max. :10.700 Max. :20.100
## NA's :1606 NA's :1606 NA's :1606
## Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## Min. : 3.30 Min. : 2.300
## 1st Qu.: 7.45 1st Qu.: 6.275
## Median :12.00 Median :10.250
## Mean :12.43 Mean :10.602
## 3rd Qu.:16.18 3rd Qu.:14.250
## Max. :40.00 Max. :28.900
## NA's :1606 NA's :1606
## Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## Min. : 4.100 Min. : 1.60
## 1st Qu.: 9.075 1st Qu.: 5.80
## Median :13.650 Median : 9.00
## Mean :14.638 Mean :10.19
## 3rd Qu.:19.025 3rd Qu.:12.80
## Max. :55.500 Max. :27.20
## NA's :1606 NA's :1607
## Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## Min. : 0.900 Min. : 3.0 Min. : 0.300
## 1st Qu.: 4.100 1st Qu.: 8.0 1st Qu.: 4.500
## Median : 6.000 Median :12.9 Median : 7.000
## Mean : 7.521 Mean :14.6 Mean : 7.891
## 3rd Qu.: 9.800 3rd Qu.:18.4 3rd Qu.:10.100
## Max. :23.300 Max. :80.2 Max. :29.300
## NA's :1607 NA's :1607 NA's :1615
## Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## Min. : 0.000 Min. : 2.10 Min. : 1.000
## 1st Qu.: 2.300 1st Qu.: 9.00 1st Qu.: 4.450
## Median : 4.500 Median :11.40 Median : 8.150
## Mean : 5.462 Mean :12.59 Mean : 9.127
## 3rd Qu.: 7.500 3rd Qu.:14.60 3rd Qu.:12.350
## Max. :26.200 Max. :32.90 Max. :21.900
## NA's :1615 NA's :1615 NA's :1624
## Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## Min. : 0.200 Min. : 7.40
## 1st Qu.: 1.475 1st Qu.:13.28
## Median : 4.700 Median :18.30
## Mean : 5.235 Mean :18.41
## 3rd Qu.: 7.875 3rd Qu.:22.73
## Max. :16.000 Max. :32.00
## NA's :1624 NA's :1624
## State_Region Source_UC
## D5 South Atlantic :313 ADDM: 86
## D8 Mountain :265 MEDI:655
## D4 West North Central:228 NSCH: 98
## D1 New England :189 SPED:853
## D3 East North Central:167
## D9 Pacific :160
## (Other) :370
## Source_Full3
## ADDM Autism & Developmental Disabilities Monitoring Network: 86
## MEDI Medicaid :655
## NSCH National Survey of Children's Health : 98
## SPED Special Education Child Count :853
##
##
##
## Prevalence_Risk2 Prevalence_Risk4 Year_Factor
## Low :740 Low :740 2012 :154
## High:952 Medium :584 2008 :127
## High :294 2002 :116
## Very High: 74 2004 :111
## 2010 :111
## 2006 :110
## (Other):963
# Check whether each columns got missing value:
lapply(ASD_State_4_MLR, function(col_x)sum(is.na(col_x)))
## $State
## [1] 0
##
## $Denominator
## [1] 0
##
## $Prevalence
## [1] 0
##
## $Lower.CI
## [1] 0
##
## $Upper.CI
## [1] 0
##
## $Year
## [1] 0
##
## $Source
## [1] 0
##
## $Source_Full1
## [1] 0
##
## $State_Full1
## [1] 0
##
## $State_Full2
## [1] 0
##
## $Numerator_ASD
## [1] 0
##
## $Numerator_NonASD
## [1] 0
##
## $Proportion
## [1] 0
##
## $Chi_Wilson_Corrected_Lower.CI
## [1] 0
##
## $Chi_Wilson_Corrected_Upper.CI
## [1] 0
##
## $Male.Prevalence
## [1] 1606
##
## $Male.Lower.CI
## [1] 1606
##
## $Male.Upper.CI
## [1] 1606
##
## $Female.Prevalence
## [1] 1606
##
## $Female.Lower.CI
## [1] 1606
##
## $Female.Upper.CI
## [1] 1606
##
## $Non.hispanic.white.Prevalence
## [1] 1606
##
## $Non.hispanic.white.Lower.CI
## [1] 1606
##
## $Non.hispanic.white.Upper.CI
## [1] 1606
##
## $Non.hispanic.black.Prevalence
## [1] 1607
##
## $Non.hispanic.black.Lower.CI
## [1] 1607
##
## $Non.hispanic.black.Upper.CI
## [1] 1607
##
## $Hispanic.Prevalence
## [1] 1615
##
## $Hispanic.Lower.CI
## [1] 1615
##
## $Hispanic.Upper.CI
## [1] 1615
##
## $Asian.or.Pacific.Islander.Prevalence
## [1] 1624
##
## $Asian.or.Pacific.Islander.Lower.CI
## [1] 1624
##
## $Asian.or.Pacific.Islander.Upper.CI
## [1] 1624
##
## $State_Region
## [1] 0
##
## $Source_UC
## [1] 0
##
## $Source_Full3
## [1] 0
##
## $Prevalence_Risk2
## [1] 0
##
## $Prevalence_Risk4
## [1] 0
##
## $Year_Factor
## [1] 0
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=3)
barplot(apply(ASD_State_4_MLR, 2, function(col_x)sum(is.na(col_x))))
dim(ASD_State_4_MLR)
## [1] 1692 39
#Get all the column variables which contains missing value
NA_Column_Names <- names(ASD_State_4_MLR[0, colSums(is.na(ASD_State_4_MLR)) > 0])
#
NA_Column_Names
## [1] "Male.Prevalence"
## [2] "Male.Lower.CI"
## [3] "Male.Upper.CI"
## [4] "Female.Prevalence"
## [5] "Female.Lower.CI"
## [6] "Female.Upper.CI"
## [7] "Non.hispanic.white.Prevalence"
## [8] "Non.hispanic.white.Lower.CI"
## [9] "Non.hispanic.white.Upper.CI"
## [10] "Non.hispanic.black.Prevalence"
## [11] "Non.hispanic.black.Lower.CI"
## [12] "Non.hispanic.black.Upper.CI"
## [13] "Hispanic.Prevalence"
## [14] "Hispanic.Lower.CI"
## [15] "Hispanic.Upper.CI"
## [16] "Asian.or.Pacific.Islander.Prevalence"
## [17] "Asian.or.Pacific.Islander.Lower.CI"
## [18] "Asian.or.Pacific.Islander.Upper.CI"
# Remove these columns from dataframe
ASD_State_4_MLR <- ASD_State_4_MLR[ , !(names(ASD_State_4_MLR) %in% NA_Column_Names)]
#
head(ASD_State_4_MLR)
## State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1 AZ 45322 6.5 5.8 7.3 2000 addm
## 2 GA 43593 6.5 5.8 7.3 2000 addm
## 3 MD 21532 5.5 4.6 6.6 2000 addm
## 4 NJ 29714 9.9 8.9 11.1 2000 addm
## 5 SC 24535 6.3 5.4 7.4 2000 addm
## 6 WV 23065 4.5 3.7 5.5 2000 addm
## Source_Full1 State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network Arizona
## 2 Autism & Developmental Disabilities Monitoring Network Georgia
## 3 Autism & Developmental Disabilities Monitoring Network Maryland
## 4 Autism & Developmental Disabilities Monitoring Network New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network West Virginia
## State_Full2 Numerator_ASD Numerator_NonASD Proportion
## 1 AZ-Arizona 295 45027 0.006508980
## 2 GA-Georgia 283 43310 0.006491868
## 3 MD-Maryland 118 21414 0.005480215
## 4 NJ-New Jersey 294 29420 0.009894326
## 5 SC-South Carolina 155 24380 0.006317506
## 6 WV-West Virginia 104 22961 0.004508996
## Chi_Wilson_Corrected_Lower.CI Chi_Wilson_Corrected_Upper.CI
## 1 5.798905 7.303948
## 2 5.769431 7.302595
## 3 4.557351 6.583638
## 4 8.814705 11.102544
## 5 5.381662 7.411085
## 6 3.703408 5.483723
## State_Region Source_UC
## 1 D8 Mountain ADDM
## 2 D5 South Atlantic ADDM
## 3 D5 South Atlantic ADDM
## 4 D2 Middle Atlantic ADDM
## 5 D5 South Atlantic ADDM
## 6 D5 South Atlantic ADDM
## Source_Full3 Prevalence_Risk2
## 1 ADDM Autism & Developmental Disabilities Monitoring Network High
## 2 ADDM Autism & Developmental Disabilities Monitoring Network High
## 3 ADDM Autism & Developmental Disabilities Monitoring Network High
## 4 ADDM Autism & Developmental Disabilities Monitoring Network High
## 5 ADDM Autism & Developmental Disabilities Monitoring Network High
## 6 ADDM Autism & Developmental Disabilities Monitoring Network Low
## Prevalence_Risk4 Year_Factor
## 1 Medium 2000
## 2 Medium 2000
## 3 Medium 2000
## 4 Medium 2000
## 5 Medium 2000
## 6 Low 2000
No missing values, as they have been handled earlier. Hurrah!
But some varialbe contains “leaky” information, which can be used to directly calculate the dependent variable: Prevalence. This won’t happen in real world scenario, thus they need to be removed.
cbind(names(ASD_State_4_MLR), c(1:length(names(ASD_State_4_MLR))))
## [,1] [,2]
## [1,] "State" "1"
## [2,] "Denominator" "2"
## [3,] "Prevalence" "3"
## [4,] "Lower.CI" "4"
## [5,] "Upper.CI" "5"
## [6,] "Year" "6"
## [7,] "Source" "7"
## [8,] "Source_Full1" "8"
## [9,] "State_Full1" "9"
## [10,] "State_Full2" "10"
## [11,] "Numerator_ASD" "11"
## [12,] "Numerator_NonASD" "12"
## [13,] "Proportion" "13"
## [14,] "Chi_Wilson_Corrected_Lower.CI" "14"
## [15,] "Chi_Wilson_Corrected_Upper.CI" "15"
## [16,] "State_Region" "16"
## [17,] "Source_UC" "17"
## [18,] "Source_Full3" "18"
## [19,] "Prevalence_Risk2" "19"
## [20,] "Prevalence_Risk4" "20"
## [21,] "Year_Factor" "21"
Leaky_Column_Names = c('Lower.CI', 'Upper.CI', 'Numerator_ASD', 'Numerator_NonASD', 'Proportion',
'Chi_Wilson_Corrected_Lower.CI', 'Chi_Wilson_Corrected_Upper.CI',
'Prevalence_Risk2', 'Prevalence_Risk4')
# Remove these columns from dataframe
ASD_State_4_MLR <- ASD_State_4_MLR[ , !(names(ASD_State_4_MLR) %in% Leaky_Column_Names)]
#
head(ASD_State_4_MLR)
## State Denominator Prevalence Year Source
## 1 AZ 45322 6.5 2000 addm
## 2 GA 43593 6.5 2000 addm
## 3 MD 21532 5.5 2000 addm
## 4 NJ 29714 9.9 2000 addm
## 5 SC 24535 6.3 2000 addm
## 6 WV 23065 4.5 2000 addm
## Source_Full1 State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network Arizona
## 2 Autism & Developmental Disabilities Monitoring Network Georgia
## 3 Autism & Developmental Disabilities Monitoring Network Maryland
## 4 Autism & Developmental Disabilities Monitoring Network New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network West Virginia
## State_Full2 State_Region Source_UC
## 1 AZ-Arizona D8 Mountain ADDM
## 2 GA-Georgia D5 South Atlantic ADDM
## 3 MD-Maryland D5 South Atlantic ADDM
## 4 NJ-New Jersey D2 Middle Atlantic ADDM
## 5 SC-South Carolina D5 South Atlantic ADDM
## 6 WV-West Virginia D5 South Atlantic ADDM
## Source_Full3 Year_Factor
## 1 ADDM Autism & Developmental Disabilities Monitoring Network 2000
## 2 ADDM Autism & Developmental Disabilities Monitoring Network 2000
## 3 ADDM Autism & Developmental Disabilities Monitoring Network 2000
## 4 ADDM Autism & Developmental Disabilities Monitoring Network 2000
## 5 ADDM Autism & Developmental Disabilities Monitoring Network 2000
## 6 ADDM Autism & Developmental Disabilities Monitoring Network 2000
Remove redundant/duplicate variables (aliased coefficients), retaining one for each type of information is enough:
https://en.wikipedia.org/wiki/Multicollinearity
https://stats.stackexchange.com/questions/112442/what-are-aliased-coefficients
Redundant_Column_Names = c('State', 'Source_Full1', 'State_Full1', 'State_Region', 'Source_UC', 'Source_Full3', 'Year_Factor')
# Remove these columns from dataframe
ASD_State_4_MLR <- ASD_State_4_MLR[ , !(names(ASD_State_4_MLR) %in% Redundant_Column_Names)]
#
head(ASD_State_4_MLR)
## Denominator Prevalence Year Source State_Full2
## 1 45322 6.5 2000 addm AZ-Arizona
## 2 43593 6.5 2000 addm GA-Georgia
## 3 21532 5.5 2000 addm MD-Maryland
## 4 29714 9.9 2000 addm NJ-New Jersey
## 5 24535 6.3 2000 addm SC-South Carolina
## 6 23065 4.5 2000 addm WV-West Virginia
MLR Workshop Task: 3. c. Visualize the data to gain insights
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
plot(ASD_State_4_MLR$Year, ASD_State_4_MLR$Prevalence)
plot(as.factor(ASD_State_4_MLR$Year), ASD_State_4_MLR$Prevalence)
plot(ASD_State_4_MLR$Denominator, ASD_State_4_MLR$Prevalence)
# To use bin() function
# https://www.rdocumentation.org/packages/OneR/versions/2.2/topics/bin
if(!require(OneR)){install.packages("OneR")}
## Loading required package: OneR
library('OneR')
# Bin 'Denominator'
plot(bin(ASD_State_4_MLR$Denominator, nbins = 10), ASD_State_4_MLR$Prevalence)
plot(ASD_State_4_MLR$Source, ASD_State_4_MLR$Prevalence)
plot(ASD_State_4_MLR$State_Full2, ASD_State_4_MLR$Prevalence)
MLR Workshop Task: 4. d. Compute correlation between variables and apply multiple regression.
Recode categorical variable to dummy (numeric) variable using one-hot encoding:
# To use select_if() function
if(!require(dplyr)){install.packages("dplyr")}
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("dplyr")
summary(select_if(ASD_State_4_MLR, is.numeric))
## Denominator Prevalence Year
## Min. : 965 Min. : 0.400 Min. :2000
## 1st Qu.: 107151 1st Qu.: 3.100 1st Qu.:2003
## Median : 353328 Median : 5.600 Median :2007
## Mean : 604689 Mean : 7.191 Mean :2007
## 3rd Qu.: 767928 3rd Qu.: 9.200 3rd Qu.:2011
## Max. :5824922 Max. :42.700 Max. :2016
correlation = cor(select_if(ASD_State_4_MLR, is.numeric))
correlation
## Denominator Prevalence Year
## Denominator 1.00000000 -0.1374662 0.02851671
## Prevalence -0.13746621 1.0000000 0.64002950
## Year 0.02851671 0.6400295 1.00000000
# Variable's correlation against target dependent variable:
correlation[, 2]
## Denominator Prevalence Year
## -0.1374662 1.0000000 0.6400295
if(!require(corrplot)){install.packages("corrplot")}
## Loading required package: corrplot
## corrplot 0.84 loaded
library("corrplot")
corrplot(correlation, tl.col="black", tl.pos = "lt")
str(ASD_State_4_MLR)
## 'data.frame': 1692 obs. of 5 variables:
## $ Denominator: int 45322 43593 21532 29714 24535 23065 35472 45113 36472 11020 ...
## $ Prevalence : num 6.5 6.5 5.5 9.9 6.3 4.5 3.3 6.2 6.9 5.9 ...
## $ Year : int 2000 2000 2000 2000 2000 2000 2002 2002 2002 2002 ...
## $ Source : Factor w/ 4 levels "addm","medi",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ State_Full2: Factor w/ 51 levels "AK-Alaska","AL-Alabama",..: 4 11 21 32 41 50 2 4 3 6 ...
# To build (National level) ASD Prevalence predictive model for all state's:
# In situations that we won't know the US. State name, we can also fit a model without State name/code:
fit_model = lm(Prevalence ~ . - State_Full2, data = ASD_State_4_MLR) # Exclude a variable using: "- variable"
#
summary(fit_model)
##
## Call:
## lm(formula = Prevalence ~ . - State_Full2, data = ASD_State_4_MLR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4476 -1.9332 -0.2786 1.2479 21.0130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.310e+03 3.819e+01 -34.306 <2e-16 ***
## Denominator -1.139e-07 1.057e-07 -1.078 0.281
## Year 6.583e-01 1.902e-02 34.606 <2e-16 ***
## Sourcemedi -4.550e+00 3.964e-01 -11.478 <2e-16 ***
## Sourcensch 6.699e+00 5.171e-01 12.956 <2e-16 ***
## Sourcesped -5.611e+00 3.984e-01 -14.085 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 1686 degrees of freedom
## Multiple R-squared: 0.6585, Adjusted R-squared: 0.6575
## F-statistic: 650.2 on 5 and 1686 DF, p-value: < 2.2e-16
<p>
Adjusted $R^2$ = 0.6575
</p>
# To build (US. State level) ASD Prevalence predictive model for specific state's prevalence:
# In situations that we shall know the US. State name. (A state name is required during prediciton.)
fit_model = lm(Prevalence ~ . , data = ASD_State_4_MLR) # "~." means all other variables, including factors
#
summary(fit_model)
##
## Call:
## lm(formula = Prevalence ~ ., data = ASD_State_4_MLR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3326 -1.3626 -0.0689 1.2558 19.0273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.305e+03 3.144e+01 -41.498 < 2e-16 ***
## Denominator 1.152e-06 2.252e-07 5.115 3.50e-07 ***
## Year 6.558e-01 1.566e-02 41.889 < 2e-16 ***
## Sourcemedi -4.997e+00 3.557e-01 -14.048 < 2e-16 ***
## Sourcensch 6.100e+00 4.404e-01 13.853 < 2e-16 ***
## Sourcesped -6.699e+00 3.915e-01 -17.110 < 2e-16 ***
## State_Full2AL-Alabama 1.377e+00 6.917e-01 1.990 0.046725 *
## State_Full2AR-Arkansas -8.466e-01 6.838e-01 -1.238 0.215831
## State_Full2AZ-Arizona 1.592e-01 6.841e-01 0.233 0.816057
## State_Full2CA-California -5.092e+00 1.188e+00 -4.284 1.94e-05 ***
## State_Full2CO-Colorado -2.565e+00 6.875e-01 -3.732 0.000197 ***
## State_Full2CT-Connecticut 3.031e-01 7.004e-01 0.433 0.665205
## State_Full2DC-District of Columbia -1.383e+00 7.510e-01 -1.841 0.065799 .
## State_Full2DE-Delaware 5.313e-01 7.030e-01 0.756 0.449933
## State_Full2FL-Florida -2.829e+00 7.793e-01 -3.630 0.000292 ***
## State_Full2GA-Georgia -1.024e+00 7.045e-01 -1.454 0.146273
## State_Full2HI-Hawaii -2.149e+00 7.087e-01 -3.032 0.002467 **
## State_Full2IA-Iowa -2.160e+00 7.046e-01 -3.066 0.002206 **
## State_Full2ID-Idaho 2.484e+00 7.034e-01 3.532 0.000424 ***
## State_Full2IL-Illinois -1.648e+00 7.613e-01 -2.164 0.030575 *
## State_Full2IN-Indiana 1.420e+00 7.103e-01 2.000 0.045713 *
## State_Full2KS-Kansas -8.059e-01 7.159e-01 -1.126 0.260465
## State_Full2KY-Kentucky -8.268e-01 7.077e-01 -1.168 0.242887
## State_Full2LA-Louisiana -2.557e+00 7.104e-01 -3.600 0.000328 ***
## State_Full2MA-Massachusetts 7.401e-01 7.074e-01 1.046 0.295625
## State_Full2MD-Maryland 2.673e-01 6.792e-01 0.394 0.693986
## State_Full2ME-Maine 6.435e+00 7.353e-01 8.752 < 2e-16 ***
## State_Full2MI-Michigan 3.667e-01 7.380e-01 0.497 0.619395
## State_Full2MN-Minnesota 6.592e+00 6.997e-01 9.422 < 2e-16 ***
## State_Full2MO-Missouri 1.443e-01 6.842e-01 0.211 0.832955
## State_Full2MS-Mississippi -2.222e+00 7.170e-01 -3.099 0.001972 **
## State_Full2MT-Montana 1.431e+00 7.210e-01 1.984 0.047367 *
## State_Full2NC-North Carolina -2.705e-01 6.979e-01 -0.388 0.698328
## State_Full2ND-North Dakota 7.056e-01 7.088e-01 0.996 0.319629
## State_Full2NE-Nebraska -1.845e+00 7.090e-01 -2.602 0.009349 **
## State_Full2NH-New Hampshire 2.265e+00 6.978e-01 3.245 0.001197 **
## State_Full2NJ-New Jersey 1.871e+00 6.945e-01 2.694 0.007140 **
## State_Full2NM-New Mexico -2.450e+00 7.156e-01 -3.424 0.000633 ***
## State_Full2NV-Nevada -5.725e-01 7.040e-01 -0.813 0.416221
## State_Full2NY-New York -2.234e+00 7.963e-01 -2.805 0.005089 **
## State_Full2OH-Ohio -1.097e+00 7.532e-01 -1.456 0.145501
## State_Full2OK-Oklahoma -1.934e+00 7.023e-01 -2.753 0.005965 **
## State_Full2OR-Oregon 3.188e+00 6.957e-01 4.582 4.96e-06 ***
## State_Full2PA-Pennsylvania 1.478e-01 7.215e-01 0.205 0.837668
## State_Full2RI-Rhode Island 4.109e+00 6.978e-01 5.889 4.70e-09 ***
## State_Full2SC-South Carolina -1.590e+00 6.839e-01 -2.326 0.020154 *
## State_Full2SD-South Dakota -1.503e+00 7.087e-01 -2.121 0.034061 *
## State_Full2TN-Tennessee -2.489e+00 7.100e-01 -3.506 0.000467 ***
## State_Full2TX-Texas -4.918e+00 9.844e-01 -4.996 6.48e-07 ***
## State_Full2UT-Utah -5.550e-01 6.867e-01 -0.808 0.419081
## State_Full2VA-Virginia 7.937e-02 7.182e-01 0.111 0.912020
## State_Full2VT-Vermont 3.614e+00 7.147e-01 5.057 4.74e-07 ***
## State_Full2WA-Washington -1.732e+00 7.168e-01 -2.416 0.015804 *
## State_Full2WI-Wisconsin -2.749e-01 6.821e-01 -0.403 0.686973
## State_Full2WV-West Virginia 5.187e-01 6.935e-01 0.748 0.454664
## State_Full2WY-Wyoming -6.109e-01 7.210e-01 -0.847 0.396968
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.812 on 1636 degrees of freedom
## Multiple R-squared: 0.7772, Adjusted R-squared: 0.7697
## F-statistic: 103.8 on 55 and 1636 DF, p-value: < 2.2e-16
<p>
Adjusted $R^2$ = 0.7697
</p>
MLR Workshop Task: 5. e. Check multicollinearity, then how to remove multicollinearity.
< Detection of multicollinearity >
Some authors have suggested a formal detection-tolerance or the variance inflation factor (VIF) for multicollinearity. A VIF of 5 or 10 and above indicates a multicollinearity problem.
# To use select_if() function
if(!require(car)){install.packages("car")}
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library("car")
vif(fit_model)
## GVIF Df GVIF^(1/(2*Df))
## Denominator 7.737921 1 2.781712
## Year 1.146987 1 1.070975
## Source 2.502245 3 1.165167
## State_Full2 7.768262 50 1.020712
[ Tips ] We notice VIF of Denominator and State_Full2 are high. Let’s exclude them one at a time.
Retain: State; Remove: Denominator then re-build model:
# To build (National level) ASD Prevalence predictive model for all state's:
# In situations that we won't know the US. State name, we can also fit a model without State name/code:
fit_model_with_State = lm(Prevalence ~ . - Denominator, data = ASD_State_4_MLR) # Exclude a variable using: "- variable"
#
summary(fit_model_with_State)
##
## Call:
## lm(formula = Prevalence ~ . - Denominator, data = ASD_State_4_MLR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7769 -1.3853 -0.0184 1.2957 19.2545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.319e+03 3.157e+01 -41.774 < 2e-16 ***
## Year 6.624e-01 1.572e-02 42.130 < 2e-16 ***
## Sourcemedi -4.503e+00 3.450e-01 -13.053 < 2e-16 ***
## Sourcensch 6.149e+00 4.436e-01 13.861 < 2e-16 ***
## Sourcesped -5.685e+00 3.403e-01 -16.708 < 2e-16 ***
## State_Full2AL-Alabama 1.875e+00 6.901e-01 2.717 0.006662 **
## State_Full2AR-Arkansas -4.970e-01 6.855e-01 -0.725 0.468547
## State_Full2AZ-Arizona 8.608e-01 6.754e-01 1.275 0.202621
## State_Full2CA-California -2.119e-01 7.141e-01 -0.297 0.766731
## State_Full2CO-Colorado -2.086e+00 6.862e-01 -3.039 0.002410 **
## State_Full2CT-Connecticut 6.142e-01 7.031e-01 0.874 0.382493
## State_Full2DC-District of Columbia -1.414e+00 7.567e-01 -1.869 0.061793 .
## State_Full2DE-Delaware 5.281e-01 7.084e-01 0.746 0.456070
## State_Full2FL-Florida -1.004e+00 6.982e-01 -1.438 0.150554
## State_Full2GA-Georgia 2.090e-02 6.794e-01 0.031 0.975460
## State_Full2HI-Hawaii -2.126e+00 7.141e-01 -2.977 0.002949 **
## State_Full2IA-Iowa -1.919e+00 7.084e-01 -2.709 0.006828 **
## State_Full2ID-Idaho 2.595e+00 7.085e-01 3.663 0.000258 ***
## State_Full2IL-Illinois -1.531e-01 7.084e-01 -0.216 0.828894
## State_Full2IN-Indiana 2.102e+00 7.031e-01 2.990 0.002833 **
## State_Full2KS-Kansas -5.955e-01 7.202e-01 -0.827 0.408469
## State_Full2KY-Kentucky -4.094e-01 7.084e-01 -0.578 0.563423
## State_Full2LA-Louisiana -2.034e+00 7.084e-01 -2.872 0.004134 **
## State_Full2MA-Massachusetts 1.337e+00 7.031e-01 1.901 0.057418 .
## State_Full2MD-Maryland 8.308e-01 6.754e-01 1.230 0.218796
## State_Full2ME-Maine 6.458e+00 7.409e-01 8.716 < 2e-16 ***
## State_Full2MI-Michigan 1.516e+00 7.084e-01 2.139 0.032544 *
## State_Full2MN-Minnesota 7.094e+00 6.980e-01 10.164 < 2e-16 ***
## State_Full2MO-Missouri 7.645e-01 6.785e-01 1.127 0.259995
## State_Full2MS-Mississippi -1.940e+00 7.204e-01 -2.692 0.007164 **
## State_Full2MT-Montana 1.455e+00 7.265e-01 2.002 0.045447 *
## State_Full2NC-North Carolina 6.671e-01 6.785e-01 0.983 0.325658
## State_Full2ND-North Dakota 6.543e-01 7.141e-01 0.916 0.359735
## State_Full2NE-Nebraska -1.744e+00 7.141e-01 -2.442 0.014701 *
## State_Full2NH-New Hampshire 2.320e+00 7.031e-01 3.300 0.000987 ***
## State_Full2NJ-New Jersey 2.669e+00 6.819e-01 3.914 9.44e-05 ***
## State_Full2NM-New Mexico -2.283e+00 7.204e-01 -3.169 0.001557 **
## State_Full2NV-Nevada -3.875e-01 7.084e-01 -0.547 0.584423
## State_Full2NY-New York -2.707e-01 7.031e-01 -0.385 0.700290
## State_Full2OH-Ohio 2.075e-01 7.141e-01 0.291 0.771469
## State_Full2OK-Oklahoma -1.525e+00 7.031e-01 -2.169 0.030198 *
## State_Full2OR-Oregon 3.511e+00 6.981e-01 5.030 5.46e-07 ***
## State_Full2PA-Pennsylvania 1.324e+00 6.891e-01 1.921 0.054844 .
## State_Full2RI-Rhode Island 4.144e+00 7.031e-01 5.895 4.55e-09 ***
## State_Full2SC-South Carolina -1.087e+00 6.819e-01 -1.594 0.111164
## State_Full2SD-South Dakota -1.525e+00 7.141e-01 -2.135 0.032901 *
## State_Full2TN-Tennessee -1.817e+00 7.031e-01 -2.584 0.009849 **
## State_Full2TX-Texas -1.456e+00 7.204e-01 -2.022 0.043387 *
## State_Full2UT-Utah -2.537e-01 6.894e-01 -0.368 0.712949
## State_Full2VA-Virginia 8.313e-01 7.084e-01 1.173 0.240803
## State_Full2VT-Vermont 3.604e+00 7.201e-01 5.005 6.18e-07 ***
## State_Full2WA-Washington -1.016e+00 7.084e-01 -1.434 0.151848
## State_Full2WI-Wisconsin 2.790e-01 6.786e-01 0.411 0.680989
## State_Full2WV-West Virginia 6.708e-01 6.982e-01 0.961 0.336837
## State_Full2WY-Wyoming -6.188e-01 7.265e-01 -0.852 0.394499
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.834 on 1637 degrees of freedom
## Multiple R-squared: 0.7736, Adjusted R-squared: 0.7662
## F-statistic: 103.6 on 54 and 1637 DF, p-value: < 2.2e-16
vif(fit_model_with_State)
## GVIF Df GVIF^(1/(2*Df))
## Year 1.139247 1 1.067355
## Source 1.306350 3 1.045546
## State_Full2 1.149997 50 1.001399
<p>
Adjusted $R^2$ = 0.7662
</p>
Retain: Denominator; Remove: State; then re-build model:
# To build (National level) ASD Prevalence predictive model for all state's:
# In situations that we won't know the US. State name, we can also fit a model without State name/code:
fit_model_with_Denominator = lm(Prevalence ~ . - State_Full2, data = ASD_State_4_MLR) # Exclude a variable using: "- variable"
#
summary(fit_model_with_Denominator)
##
## Call:
## lm(formula = Prevalence ~ . - State_Full2, data = ASD_State_4_MLR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4476 -1.9332 -0.2786 1.2479 21.0130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.310e+03 3.819e+01 -34.306 <2e-16 ***
## Denominator -1.139e-07 1.057e-07 -1.078 0.281
## Year 6.583e-01 1.902e-02 34.606 <2e-16 ***
## Sourcemedi -4.550e+00 3.964e-01 -11.478 <2e-16 ***
## Sourcensch 6.699e+00 5.171e-01 12.956 <2e-16 ***
## Sourcesped -5.611e+00 3.984e-01 -14.085 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 1686 degrees of freedom
## Multiple R-squared: 0.6585, Adjusted R-squared: 0.6575
## F-statistic: 650.2 on 5 and 1686 DF, p-value: < 2.2e-16
vif(fit_model_with_Denominator)
## GVIF Df GVIF^(1/(2*Df))
## Denominator 1.145505 1 1.070283
## Year 1.138674 1 1.067086
## Source 1.301973 3 1.044962
<p>
Adjusted $R^2$ = 0.6575
</p>
MLR Workshop Task: 6. f. How is your final model looks like?
During prediction, if US. State name will be known, then the fit_model_with_State can be better because it has higher \(R^2\) value.
During prediction, if US. State name will NOT be known, then the fit_model_with_Denominator can be adopted because it doesn’t require state name as input for prediciton.
MLR Prediciton 1
Let’s use fit_model_with_State to predict CA-California ASD Prevalence of Year 2016 if ADDM would have conducted a survey
newdata = ASD_State_4_MLR[1,] # Copy datastructure
newdata$Prevalence = NA
newdata$Denominator = 50000
newdata$Year = 2016
newdata$Source = "addm"
#newdata$State_Full2 = "CA-California"
newdata$State_Full2 = "AZ-Arizona"
newdata
## Denominator Prevalence Year Source State_Full2
## 1 50000 NA 2016 addm AZ-Arizona
predict(fit_model_with_State, newdata, interval = "predict")
## fit lwr upr
## 1 17.54292 11.88535 23.20049
#
cat("Predicted ASD Prevalence is", round(predict(fit_model_with_State, newdata), 1), "per 1,000 Children")
## Predicted ASD Prevalence is 17.5 per 1,000 Children
MLR Prediciton 2
Let’s use fit_model_with_Denominator to predict National level ASD Prevalence of Year 2016 if ADDM would have conducted a survey
predict(fit_model_with_Denominator, newdata, interval = "predict")
## fit lwr upr
## 1 17.07629 10.30306 23.84952
#
cat("Predicted ASD Prevalence is", round(predict(fit_model_with_Denominator, newdata), 1), "per 1,000 Children")
## Predicted ASD Prevalence is 17.1 per 1,000 Children
MLR Prediciton 2
Let’s use fit_model to predict FL-Florida State level ASD Prevalence of Year 2025 if SPED will conduct a record review/survey of 2,600,000 children.
newdata = ASD_State_4_MLR[1,] # Copy datastructure
newdata$Prevalence = NA
newdata$Denominator = 2600000
newdata$Year = 2025
newdata$Source = "sped"
newdata$State_Full2 = "FL-Florida"
newdata
## Denominator Prevalence Year Source State_Full2
## 1 2600000 NA 2025 sped FL-Florida
predict(fit_model, newdata, interval = "predict")
## fit lwr upr
## 1 16.63773 11.00985 22.2656
#
cat("Predicted ASD Prevalence is", round(predict(fit_model, newdata), 1), "per 1,000 Children")
## Predicted ASD Prevalence is 16.6 per 1,000 Children
<a href="">
<img src="" width="750" align="center">
</a>
<h3>
Linear Model: Polynomial (Linear) Regression (PLR) - Workshop Task
</h3>
Workshop Task:
PLR Workshop Task: 1. a. Get the data.
Use Case Data: “../dataset/ADV_ASD_State_R.csv”
Read in CSV data, storing as R dataframe
# Read back in above saved file:
ASD_State <- read.csv("../dataset/ADV_ASD_State_R.csv")
ASD_State$Year_Factor <- factor(ASD_State$Year_Factor, ordered = TRUE) # Convert Year_Factor to ordered.factor
ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE, levels=c("Low", "High"))
ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE, levels=c("Low", "Medium", "High", "Very High"))
head(ASD_State)
## State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1 AZ 45322 6.5 5.8 7.3 2000 addm
## 2 GA 43593 6.5 5.8 7.3 2000 addm
## 3 MD 21532 5.5 4.6 6.6 2000 addm
## 4 NJ 29714 9.9 8.9 11.1 2000 addm
## 5 SC 24535 6.3 5.4 7.4 2000 addm
## 6 WV 23065 4.5 3.7 5.5 2000 addm
## Source_Full1 State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network Arizona
## 2 Autism & Developmental Disabilities Monitoring Network Georgia
## 3 Autism & Developmental Disabilities Monitoring Network Maryland
## 4 Autism & Developmental Disabilities Monitoring Network New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network West Virginia
## State_Full2 Numerator_ASD Numerator_NonASD Proportion
## 1 AZ-Arizona 295 45027 0.006508980
## 2 GA-Georgia 283 43310 0.006491868
## 3 MD-Maryland 118 21414 0.005480215
## 4 NJ-New Jersey 294 29420 0.009894326
## 5 SC-South Carolina 155 24380 0.006317506
## 6 WV-West Virginia 104 22961 0.004508996
## Chi_Wilson_Corrected_Lower.CI Chi_Wilson_Corrected_Upper.CI Male.Prevalence
## 1 5.798905 7.303948 9.7
## 2 5.769431 7.302595 11.0
## 3 4.557351 6.583638 8.6
## 4 8.814705 11.102544 14.8
## 5 5.381662 7.411085 9.3
## 6 3.703408 5.483723 6.6
## Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1 8.5 11.1 3.2 2.5 4.0
## 2 9.7 12.4 2.0 1.5 2.7
## 3 7.1 10.6 2.2 1.5 2.7
## 4 13.0 16.8 4.3 3.3 5.5
## 5 7.8 11.2 3.3 2.4 4.5
## 6 5.2 8.2 2.4 1.6 3.5
## Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1 8.6 7.5
## 2 7.9 6.7
## 3 4.9 3.8
## 4 11.3 9.5
## 5 6.5 5.2
## 6 4.5 3.7
## Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1 9.8 7.3
## 2 9.3 5.3
## 3 6.4 6.1
## 4 13.3 10.6
## 5 8.2 5.8
## 6 5.5 NA
## Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1 4.4 12.2 NA
## 2 4.4 6.4 NA
## 3 4.7 8.0 NA
## 4 8.5 13.1 NA
## 5 4.5 7.3 NA
## 6 NA NA NA
## Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## State_Region Source_UC
## 1 D8 Mountain ADDM
## 2 D5 South Atlantic ADDM
## 3 D5 South Atlantic ADDM
## 4 D2 Middle Atlantic ADDM
## 5 D5 South Atlantic ADDM
## 6 D5 South Atlantic ADDM
## Source_Full3 Prevalence_Risk2
## 1 ADDM Autism & Developmental Disabilities Monitoring Network High
## 2 ADDM Autism & Developmental Disabilities Monitoring Network High
## 3 ADDM Autism & Developmental Disabilities Monitoring Network High
## 4 ADDM Autism & Developmental Disabilities Monitoring Network High
## 5 ADDM Autism & Developmental Disabilities Monitoring Network High
## 6 ADDM Autism & Developmental Disabilities Monitoring Network Low
## Prevalence_Risk4 Year_Factor
## 1 Medium 2000
## 2 Medium 2000
## 3 Medium 2000
## 4 Medium 2000
## 5 Medium 2000
## 6 Low 2000
names(ASD_State)
## [1] "State"
## [2] "Denominator"
## [3] "Prevalence"
## [4] "Lower.CI"
## [5] "Upper.CI"
## [6] "Year"
## [7] "Source"
## [8] "Source_Full1"
## [9] "State_Full1"
## [10] "State_Full2"
## [11] "Numerator_ASD"
## [12] "Numerator_NonASD"
## [13] "Proportion"
## [14] "Chi_Wilson_Corrected_Lower.CI"
## [15] "Chi_Wilson_Corrected_Upper.CI"
## [16] "Male.Prevalence"
## [17] "Male.Lower.CI"
## [18] "Male.Upper.CI"
## [19] "Female.Prevalence"
## [20] "Female.Lower.CI"
## [21] "Female.Upper.CI"
## [22] "Non.hispanic.white.Prevalence"
## [23] "Non.hispanic.white.Lower.CI"
## [24] "Non.hispanic.white.Upper.CI"
## [25] "Non.hispanic.black.Prevalence"
## [26] "Non.hispanic.black.Lower.CI"
## [27] "Non.hispanic.black.Upper.CI"
## [28] "Hispanic.Prevalence"
## [29] "Hispanic.Lower.CI"
## [30] "Hispanic.Upper.CI"
## [31] "Asian.or.Pacific.Islander.Prevalence"
## [32] "Asian.or.Pacific.Islander.Lower.CI"
## [33] "Asian.or.Pacific.Islander.Upper.CI"
## [34] "State_Region"
## [35] "Source_UC"
## [36] "Source_Full3"
## [37] "Prevalence_Risk2"
## [38] "Prevalence_Risk4"
## [39] "Year_Factor"
# Filter [ Source: SPED ], including only two clomuns for SLR:
# Dependent variable: Prevalence
# independent variable: Year
# ASD_State_4_PLR = subset(ASD_State, Source_UC == 'SPED' & State_Full2 == 'MN-Minnesota', select = c(Prevalence, Year))
# ASD_State_4_PLR = subset(ASD_State, Source_UC == 'SPED' & State_Full2 == 'MS-Mississippi', select = c(Prevalence, Year))
ASD_State_4_PLR = subset(ASD_State, Source_UC == 'SPED' & State_Full2 == 'FL-Florida', select = c(Prevalence, Year))
#
dim(ASD_State_4_PLR)
## [1] 17 2
head(ASD_State_4_PLR)
## Prevalence Year
## 849 1.5 2000
## 900 1.8 2001
## 951 2.1 2002
## 1002 2.4 2003
## 1052 2.7 2004
## 1100 3.0 2005
PLR Workshop Task: 2. b. Discover and visualize the data to gain insights (Is there missing Value in the dataframe, then how to deal with the missing value).
summary(ASD_State_4_PLR)
## Prevalence Year
## Min. : 1.500 Min. :2000
## 1st Qu.: 2.700 1st Qu.:2004
## Median : 4.900 Median :2008
## Mean : 5.694 Mean :2008
## 3rd Qu.: 8.300 3rd Qu.:2012
## Max. :12.100 Max. :2016
# Check whether each columns got missing value:
lapply(ASD_State_4_PLR, function(col_x)sum(is.na(col_x)))
## $Prevalence
## [1] 0
##
## $Year
## [1] 0
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=3)
barplot(apply(ASD_State_4_PLR, 2, function(col_x)sum(is.na(col_x))))
No missing values
PLR Workshop Task: 3. c. Visualize the data to gain insights
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
plot(ASD_State_4_PLR$Year, ASD_State_4_PLR$Prevalence)
PLR Workshop Task: 4. d. Compute correlation between variables and apply multiple regression.
Recode categorical variable to dummy (numeric) variable using one-hot encoding:
# To use select_if() function
if(!require(dplyr)){install.packages("dplyr")}
library("dplyr")
summary(select_if(ASD_State_4_PLR, is.numeric))
## Prevalence Year
## Min. : 1.500 Min. :2000
## 1st Qu.: 2.700 1st Qu.:2004
## Median : 4.900 Median :2008
## Mean : 5.694 Mean :2008
## 3rd Qu.: 8.300 3rd Qu.:2012
## Max. :12.100 Max. :2016
correlation = cor(select_if(ASD_State_4_PLR, is.numeric))
correlation
## Prevalence Year
## Prevalence 1.000000 0.980119
## Year 0.980119 1.000000
# Variable's correlation against target dependent variable:
correlation[, 1]
## Prevalence Year
## 1.000000 0.980119
str(ASD_State_4_PLR)
## 'data.frame': 17 obs. of 2 variables:
## $ Prevalence: num 1.5 1.8 2.1 2.4 2.7 3 3.5 4.1 4.9 5.7 ...
## $ Year : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
# SLR
fit_model_SLR = lm(Prevalence ~ Year , data = ASD_State_4_PLR)
#
summary(fit_model_SLR)
##
## Call:
## lm(formula = Prevalence ~ Year, data = ASD_State_4_PLR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9150 -0.6566 -0.1108 0.4809 1.2392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1358.0726 71.2824 -19.05 6.37e-12 ***
## Year 0.6792 0.0355 19.13 6.00e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.717 on 15 degrees of freedom
## Multiple R-squared: 0.9606, Adjusted R-squared: 0.958
## F-statistic: 366 on 1 and 15 DF, p-value: 5.995e-12
plot(fit_model_SLR)
<p>
Adjusted $R^2$ = 0.958
</p>
# PLR (quadratic)
fit_model_PLR = lm(Prevalence ~ Year + I(Year^2), data = ASD_State_4_PLR)
#
summary(fit_model_PLR)
##
## Call:
## lm(formula = Prevalence ~ Year + I(Year^2), data = ASD_State_4_PLR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26223 -0.05325 0.03671 0.11045 0.17918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.231e+05 6.980e+03 17.64 5.87e-11 ***
## Year -1.233e+02 6.953e+00 -17.73 5.45e-11 ***
## I(Year^2) 3.087e-02 1.731e-03 17.83 5.07e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1524 on 14 degrees of freedom
## Multiple R-squared: 0.9983, Adjusted R-squared: 0.9981
## F-statistic: 4209 on 2 and 14 DF, p-value: < 2.2e-16
Abount I() function: https://stackoverflow.com/questions/8055508/in-r-formulas-why-do-i-have-to-use-the-i-function-on-power-terms-like-y-i
plot(fit_model_PLR)
<p>
Adjusted $R^2$ = 0.9981
</p>
Visualise the difference between SLR & PLR:
plot(ASD_State_4_PLR$Year, ASD_State_4_PLR$Prevalence)
# add SLR line
lines(ASD_State_4_PLR$Year, predict(fit_model_SLR, ASD_State_4_PLR), col="blue", lwd=2) # or # abline(fit_model_SLR, col="blue", lwd=2)
# add PLR line
lines(ASD_State_4_PLR$Year, predict(fit_model_PLR, ASD_State_4_PLR), col="orange", lwd=2)
PLR Prediciton
newdata = ASD_State_4_PLR[1,] # Copy datastructure
newdata$Prevalence = NA
newdata$Year = 2025
newdata
## Prevalence Year
## 849 NA 2025
predict(fit_model_PLR, newdata, interval = "predict")
## fit lwr upr
## 849 25.42036 24.34469 26.49602
#
cat("Predicted ASD Prevalence of Year [", newdata$Year, "] is", round(predict(fit_model_PLR, newdata), 1), "per 1,000 Children")
## Predicted ASD Prevalence of Year [ 2025 ] is 25.4 per 1,000 Children
Multiple PLR Workshop Task: 5. e. Multiple polynomial regression (MPR). (Enhance MLR by adding higher order transformed variables.)
Resuse MLR data: ASD_State_4_MLR Cop to new dataframe: ASD_State_4_MPR
ASD_State_4_MPR = ASD_State_4_MLR
dim(ASD_State_4_MPR)
## [1] 1692 5
summary(ASD_State_4_MPR)
## Denominator Prevalence Year Source
## Min. : 965 Min. : 0.400 Min. :2000 addm: 86
## 1st Qu.: 107151 1st Qu.: 3.100 1st Qu.:2003 medi:655
## Median : 353328 Median : 5.600 Median :2007 nsch: 98
## Mean : 604689 Mean : 7.191 Mean :2007 sped:853
## 3rd Qu.: 767928 3rd Qu.: 9.200 3rd Qu.:2011
## Max. :5824922 Max. :42.700 Max. :2016
##
## State_Full2
## AZ-Arizona : 40
## MD-Maryland : 40
## GA-Georgia : 39
## MO-Missouri : 39
## NC-North Carolina: 39
## WI-Wisconsin : 39
## (Other) :1456
# Check whether each columns got missing value:
lapply(ASD_State_4_MLR, function(col_x)sum(is.na(col_x)))
## $Denominator
## [1] 0
##
## $Prevalence
## [1] 0
##
## $Year
## [1] 0
##
## $Source
## [1] 0
##
## $State_Full2
## [1] 0
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=3)
barplot(apply(ASD_State_4_MLR, 2, function(col_x)sum(is.na(col_x))))
Build Multiple PLR model: + I(Year^2) + I(log(Denominator)
fit_model_MPR = lm(Prevalence ~ . + I(Year^2) + I(log(Denominator)), data = ASD_State_4_MPR) # "~." means all other variables, including factors
#
summary(fit_model_MPR)
##
## Call:
## lm(formula = Prevalence ~ . + I(Year^2) + I(log(Denominator)),
## data = ASD_State_4_MPR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.697 -1.238 0.007 1.157 19.672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.908e+04 1.309e+04 6.044 1.86e-09 ***
## Denominator 2.074e-06 2.340e-07 8.863 < 2e-16 ***
## Year -7.943e+01 1.304e+01 -6.093 1.38e-09 ***
## Sourcemedi 8.189e-01 6.860e-01 1.194 0.232812
## Sourcensch 8.667e-02 7.138e-01 0.121 0.903371
## Sourcesped 2.740e-01 8.268e-01 0.331 0.740399
## State_Full2AL-Alabama 4.765e+00 7.495e-01 6.357 2.66e-10 ***
## State_Full2AR-Arkansas 1.978e+00 7.178e-01 2.755 0.005930 **
## State_Full2AZ-Arizona 4.043e+00 7.703e-01 5.248 1.74e-07 ***
## State_Full2CA-California -5.541e-01 1.239e+00 -0.447 0.654693
## State_Full2CO-Colorado 4.215e-01 7.283e-01 0.579 0.562857
## State_Full2CT-Connecticut 2.975e+00 7.250e-01 4.103 4.27e-05 ***
## State_Full2DC-District of Columbia -2.342e+00 7.256e-01 -3.227 0.001275 **
## State_Full2DE-Delaware 4.775e-01 6.739e-01 0.709 0.478711
## State_Full2FL-Florida 1.942e+00 8.972e-01 2.165 0.030531 *
## State_Full2GA-Georgia 3.500e+00 8.244e-01 4.245 2.31e-05 ***
## State_Full2HI-Hawaii -1.547e+00 6.823e-01 -2.267 0.023498 *
## State_Full2IA-Iowa 1.358e-01 7.160e-01 0.190 0.849558
## State_Full2ID-Idaho 3.888e+00 6.892e-01 5.642 1.98e-08 ***
## State_Full2IL-Illinois 3.055e+00 8.797e-01 3.473 0.000528 ***
## State_Full2IN-Indiana 5.214e+00 7.852e-01 6.641 4.24e-11 ***
## State_Full2KS-Kansas 1.358e+00 7.212e-01 1.883 0.059905 .
## State_Full2KY-Kentucky 2.358e+00 7.544e-01 3.126 0.001805 **
## State_Full2LA-Louisiana 9.990e-01 7.744e-01 1.290 0.197219
## State_Full2MA-Massachusetts 4.323e+00 7.724e-01 5.597 2.56e-08 ***
## State_Full2MD-Maryland 3.662e+00 7.407e-01 4.944 8.45e-07 ***
## State_Full2ME-Maine 7.323e+00 7.112e-01 10.297 < 2e-16 ***
## State_Full2MI-Michigan 4.812e+00 8.455e-01 5.691 1.49e-08 ***
## State_Full2MN-Minnesota 9.790e+00 7.477e-01 13.093 < 2e-16 ***
## State_Full2MO-Missouri 3.854e+00 7.598e-01 5.072 4.38e-07 ***
## State_Full2MS-Mississippi 5.804e-01 7.455e-01 0.779 0.436378
## State_Full2MT-Montana 1.230e+00 6.913e-01 1.779 0.075351 .
## State_Full2NC-North Carolina 3.919e+00 7.977e-01 4.913 9.84e-07 ***
## State_Full2ND-North Dakota -4.673e-01 6.898e-01 -0.677 0.498273
## State_Full2NE-Nebraska -3.600e-01 6.961e-01 -0.517 0.605101
## State_Full2NH-New Hampshire 2.854e+00 6.714e-01 4.250 2.25e-05 ***
## State_Full2NJ-New Jersey 5.703e+00 7.771e-01 7.338 3.39e-13 ***
## State_Full2NM-New Mexico -2.369e-01 7.227e-01 -0.328 0.743118
## State_Full2NV-Nevada 1.166e+00 6.982e-01 1.670 0.095033 .
## State_Full2NY-New York 2.614e+00 9.153e-01 2.856 0.004349 **
## State_Full2OH-Ohio 3.553e+00 8.682e-01 4.092 4.49e-05 ***
## State_Full2OK-Oklahoma 1.227e+00 7.474e-01 1.642 0.100745
## State_Full2OR-Oregon 5.859e+00 7.202e-01 8.135 8.06e-16 ***
## State_Full2PA-Pennsylvania 4.490e+00 8.246e-01 5.445 5.96e-08 ***
## State_Full2RI-Rhode Island 4.587e+00 6.704e-01 6.841 1.10e-11 ***
## State_Full2SC-South Carolina 1.791e+00 7.436e-01 2.408 0.016138 *
## State_Full2SD-South Dakota -1.629e+00 6.795e-01 -2.398 0.016616 *
## State_Full2TN-Tennessee 1.344e+00 7.893e-01 1.703 0.088708 .
## State_Full2TX-Texas -1.007e-03 1.078e+00 -0.001 0.999254
## State_Full2UT-Utah 1.500e+00 6.915e-01 2.169 0.030212 *
## State_Full2VA-Virginia 3.846e+00 7.919e-01 4.856 1.31e-06 ***
## State_Full2VT-Vermont 3.076e+00 6.870e-01 4.478 8.04e-06 ***
## State_Full2WA-Washington 2.227e+00 7.979e-01 2.791 0.005318 **
## State_Full2WI-Wisconsin 3.271e+00 7.469e-01 4.379 1.27e-05 ***
## State_Full2WV-West Virginia 2.188e+00 6.877e-01 3.182 0.001492 **
## State_Full2WY-Wyoming -1.521e+00 6.981e-01 -2.178 0.029521 *
## I(Year^2) 1.995e-02 3.247e-03 6.145 1.01e-09 ***
## I(log(Denominator)) -2.265e+00 2.335e-01 -9.701 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 1634 degrees of freedom
## Multiple R-squared: 0.7956, Adjusted R-squared: 0.7884
## F-statistic: 111.6 on 57 and 1634 DF, p-value: < 2.2e-16
<p>
Adjusted $R^2$ = 0.7884
</p>
Mutiple PLR Prediciton
# Copy datastructure
newdata = subset(ASD_State_4_MPR, ASD_State_4_MPR$Year == 2016 &
ASD_State_4_MPR$State_Full2 == 'FL-Florida' &
ASD_State_4_MPR$Source == 'sped')
newdata
## Denominator Prevalence Year Source State_Full2
## 1652 2555399 12.1 2016 sped FL-Florida
newdata = ASD_State_4_MPR[1,] # Copy datastructure
newdata$Prevalence = NA
newdata$Denominator = 2600000
newdata$Year = 2025
newdata$Source = "sped"
newdata$State_Full2 = "FL-Florida"
newdata
## Denominator Prevalence Year Source State_Full2
## 1 2600000 NA 2025 sped FL-Florida
predict(fit_model_MPR, newdata, interval = "predict")
## fit lwr upr
## 1 22.70557 17.01847 28.39267
#
cat("Predicted ASD Prevalence of Year [", newdata$Year, "] is", round(predict(fit_model_MPR,newdata), 1), "per 1,000 Children")
## Predicted ASD Prevalence of Year [ 2025 ] is 22.7 per 1,000 Children
<h3>
Quiz:
</h3>
<p>
Compare predicted ASD prevalence and model $R^2$ between: Multiple Linear Rregression and Multiple Polynomial Rregression.
</p>
<p>
Which prediction result would you use? Provide yor justifications.
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<a href="">
<img src="" width="750" align="center">
</a>
<h3>
Linear Model: Logistic Regression (LR) - Workshop Task
</h3>
Workshop Task:
LR Workshop Task: 1. a. Get the data.
Use Case Data: “../dataset/ADV_ASD_State_R.csv”
Read in CSV data, storing as R dataframe
# Read back in above saved file:
# ASD_State <- read.csv("../dataset/ADV_ASD_State_R.csv")
# ASD_State$Year_Factor <- factor(ASD_State$Year_Factor, ordered = TRUE) # Convert Year_Factor to ordered.factor
# ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE, levels=c("Low", "High"))
# ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE, levels=c("Low", "Medium", "High", "Very High"))
head(ASD_State)
## State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1 AZ 45322 6.5 5.8 7.3 2000 addm
## 2 GA 43593 6.5 5.8 7.3 2000 addm
## 3 MD 21532 5.5 4.6 6.6 2000 addm
## 4 NJ 29714 9.9 8.9 11.1 2000 addm
## 5 SC 24535 6.3 5.4 7.4 2000 addm
## 6 WV 23065 4.5 3.7 5.5 2000 addm
## Source_Full1 State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network Arizona
## 2 Autism & Developmental Disabilities Monitoring Network Georgia
## 3 Autism & Developmental Disabilities Monitoring Network Maryland
## 4 Autism & Developmental Disabilities Monitoring Network New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network West Virginia
## State_Full2 Numerator_ASD Numerator_NonASD Proportion
## 1 AZ-Arizona 295 45027 0.006508980
## 2 GA-Georgia 283 43310 0.006491868
## 3 MD-Maryland 118 21414 0.005480215
## 4 NJ-New Jersey 294 29420 0.009894326
## 5 SC-South Carolina 155 24380 0.006317506
## 6 WV-West Virginia 104 22961 0.004508996
## Chi_Wilson_Corrected_Lower.CI Chi_Wilson_Corrected_Upper.CI Male.Prevalence
## 1 5.798905 7.303948 9.7
## 2 5.769431 7.302595 11.0
## 3 4.557351 6.583638 8.6
## 4 8.814705 11.102544 14.8
## 5 5.381662 7.411085 9.3
## 6 3.703408 5.483723 6.6
## Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1 8.5 11.1 3.2 2.5 4.0
## 2 9.7 12.4 2.0 1.5 2.7
## 3 7.1 10.6 2.2 1.5 2.7
## 4 13.0 16.8 4.3 3.3 5.5
## 5 7.8 11.2 3.3 2.4 4.5
## 6 5.2 8.2 2.4 1.6 3.5
## Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1 8.6 7.5
## 2 7.9 6.7
## 3 4.9 3.8
## 4 11.3 9.5
## 5 6.5 5.2
## 6 4.5 3.7
## Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1 9.8 7.3
## 2 9.3 5.3
## 3 6.4 6.1
## 4 13.3 10.6
## 5 8.2 5.8
## 6 5.5 NA
## Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1 4.4 12.2 NA
## 2 4.4 6.4 NA
## 3 4.7 8.0 NA
## 4 8.5 13.1 NA
## 5 4.5 7.3 NA
## 6 NA NA NA
## Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## State_Region Source_UC
## 1 D8 Mountain ADDM
## 2 D5 South Atlantic ADDM
## 3 D5 South Atlantic ADDM
## 4 D2 Middle Atlantic ADDM
## 5 D5 South Atlantic ADDM
## 6 D5 South Atlantic ADDM
## Source_Full3 Prevalence_Risk2
## 1 ADDM Autism & Developmental Disabilities Monitoring Network High
## 2 ADDM Autism & Developmental Disabilities Monitoring Network High
## 3 ADDM Autism & Developmental Disabilities Monitoring Network High
## 4 ADDM Autism & Developmental Disabilities Monitoring Network High
## 5 ADDM Autism & Developmental Disabilities Monitoring Network High
## 6 ADDM Autism & Developmental Disabilities Monitoring Network Low
## Prevalence_Risk4 Year_Factor
## 1 Medium 2000
## 2 Medium 2000
## 3 Medium 2000
## 4 Medium 2000
## 5 Medium 2000
## 6 Low 2000
Column_Names = c("Prevalence_Risk2", "Denominator", "Year", "Source", "State_Full2")
ASD_State_4_LR_Risk2 <- ASD_State[ , (names(ASD_State) %in% Column_Names)]
dim(ASD_State_4_LR_Risk2)
## [1] 1692 5
head(ASD_State_4_LR_Risk2)
## Denominator Year Source State_Full2 Prevalence_Risk2
## 1 45322 2000 addm AZ-Arizona High
## 2 43593 2000 addm GA-Georgia High
## 3 21532 2000 addm MD-Maryland High
## 4 29714 2000 addm NJ-New Jersey High
## 5 24535 2000 addm SC-South Carolina High
## 6 23065 2000 addm WV-West Virginia Low
Column_Names = c("Prevalence_Risk4", "Denominator", "Year", "Source", "State_Full2")
ASD_State_4_LR_Risk4 <- ASD_State[ , (names(ASD_State) %in% Column_Names)]
dim(ASD_State_4_LR_Risk4)
## [1] 1692 5
head(ASD_State_4_LR_Risk4)
## Denominator Year Source State_Full2 Prevalence_Risk4
## 1 45322 2000 addm AZ-Arizona Medium
## 2 43593 2000 addm GA-Georgia Medium
## 3 21532 2000 addm MD-Maryland Medium
## 4 29714 2000 addm NJ-New Jersey Medium
## 5 24535 2000 addm SC-South Carolina Medium
## 6 23065 2000 addm WV-West Virginia Low
LR Workshop Task: 2. b. Logistic regression (LR) Binary Class. (Reuse Multiple Polynomial Model on categorical dependent variable.)
table(ASD_State_4_LR_Risk2$Prevalence_Risk2)
##
## Low High
## 740 952
barplot(table(ASD_State_4_LR_Risk2$Prevalence_Risk2))
counts = table(ASD_State_4_LR_Risk2$Prevalence_Risk2, ASD_State_4_LR_Risk2$Source)
counts
##
## addm medi nsch sped
## Low 5 344 0 391
## High 81 311 98 462
barplot(counts,
main="Prevalence by Data Sources and Risk Levels",
xlab="Data Sources",
ylab="Occurrences",
col=c("white", "lightgrey"),
legend = rownames(counts),
args.legend = list(x = "topleft", bty = "n", cex = 0.85, y.intersp = 4))
str(ASD_State_4_LR_Risk2)
## 'data.frame': 1692 obs. of 5 variables:
## $ Denominator : int 45322 43593 21532 29714 24535 23065 35472 45113 36472 11020 ...
## $ Year : int 2000 2000 2000 2000 2000 2000 2002 2002 2002 2002 ...
## $ Source : Factor w/ 4 levels "addm","medi",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ State_Full2 : Factor w/ 51 levels "AK-Alaska","AL-Alabama",..: 4 11 21 32 41 50 2 4 3 6 ...
## $ Prevalence_Risk2: Ord.factor w/ 2 levels "Low"<"High": 2 2 2 2 2 1 1 2 2 2 ...
Build model
# Binary Classification:
fit_model_LR_Risk2 = glm(Prevalence_Risk2 ~ Denominator + Year + Source + State_Full2 + I(Year^2) + I(log(Denominator)),
family=binomial(link='logit'), data = ASD_State_4_LR_Risk2)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#
summary(fit_model_LR_Risk2)
##
## Call:
## glm(formula = Prevalence_Risk2 ~ Denominator + Year + Source +
## State_Full2 + I(Year^2) + I(log(Denominator)), family = binomial(link = "logit"),
## data = ASD_State_4_LR_Risk2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1715 -0.2765 0.0036 0.2843 2.9153
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.555e+04 2.844e+04 1.250 0.211249
## Denominator 3.855e-06 5.131e-07 7.514 5.75e-14 ***
## Year -3.614e+01 2.834e+01 -1.275 0.202284
## Sourcemedi 1.381e+00 1.290e+00 1.071 0.284340
## Sourcensch 6.822e+00 4.151e+02 0.016 0.986887
## Sourcesped 1.242e+00 1.610e+00 0.771 0.440412
## State_Full2AL-Alabama 2.623e+00 1.182e+00 2.219 0.026454 *
## State_Full2AR-Arkansas 3.721e+00 1.108e+00 3.359 0.000781 ***
## State_Full2AZ-Arizona 4.067e+00 1.303e+00 3.122 0.001795 **
## State_Full2CA-California -7.922e+00 2.916e+00 -2.717 0.006595 **
## State_Full2CO-Colorado 6.498e-01 1.124e+00 0.578 0.563137
## State_Full2CT-Connecticut 2.939e+00 1.068e+00 2.752 0.005925 **
## State_Full2DC-District of Columbia -2.947e+00 9.614e-01 -3.066 0.002171 **
## State_Full2DE-Delaware -1.158e+00 8.549e-01 -1.355 0.175515
## State_Full2FL-Florida -5.360e-02 1.672e+00 -0.032 0.974422
## State_Full2GA-Georgia 3.631e+00 1.426e+00 2.547 0.010879 *
## State_Full2HI-Hawaii -9.970e-01 8.746e-01 -1.140 0.254317
## State_Full2IA-Iowa -7.461e-01 1.050e+00 -0.710 0.477531
## State_Full2ID-Idaho 3.614e+00 9.177e-01 3.938 8.21e-05 ***
## State_Full2IL-Illinois 2.320e+00 1.604e+00 1.446 0.148090
## State_Full2IN-Indiana 5.539e+00 1.302e+00 4.253 2.11e-05 ***
## State_Full2KS-Kansas 2.352e+00 9.880e-01 2.381 0.017270 *
## State_Full2KY-Kentucky 2.421e+00 1.167e+00 2.075 0.037999 *
## State_Full2LA-Louisiana -1.964e-01 1.277e+00 -0.154 0.877789
## State_Full2MA-Massachusetts 3.239e+00 1.237e+00 2.619 0.008814 **
## State_Full2MD-Maryland 5.419e+00 1.230e+00 4.405 1.06e-05 ***
## State_Full2ME-Maine 4.860e+00 1.002e+00 4.852 1.22e-06 ***
## State_Full2MI-Michigan 6.864e+00 1.528e+00 4.493 7.03e-06 ***
## State_Full2MN-Minnesota 8.257e+00 1.399e+00 5.904 3.55e-09 ***
## State_Full2MO-Missouri 4.564e+00 1.282e+00 3.561 0.000370 ***
## State_Full2MS-Mississippi 1.114e-01 1.109e+00 0.100 0.919981
## State_Full2MT-Montana -9.584e-01 8.950e-01 -1.071 0.284242
## State_Full2NC-North Carolina 4.455e+00 1.411e+00 3.157 0.001595 **
## State_Full2ND-North Dakota -5.683e-01 9.133e-01 -0.622 0.533771
## State_Full2NE-Nebraska -8.605e-01 9.379e-01 -0.917 0.358910
## State_Full2NH-New Hampshire 2.232e+00 8.838e-01 2.526 0.011541 *
## State_Full2NJ-New Jersey 4.045e+00 1.301e+00 3.109 0.001880 **
## State_Full2NM-New Mexico -7.265e-01 1.058e+00 -0.687 0.492243
## State_Full2NV-Nevada 6.972e-02 9.263e-01 0.075 0.940005
## State_Full2NY-New York 1.409e+00 1.736e+00 0.812 0.417034
## State_Full2OH-Ohio 3.032e+00 1.527e+00 1.985 0.047092 *
## State_Full2OK-Oklahoma 1.284e+00 1.171e+00 1.097 0.272786
## State_Full2OR-Oregon 7.916e+00 1.312e+00 6.033 1.61e-09 ***
## State_Full2PA-Pennsylvania 3.412e+00 1.485e+00 2.297 0.021595 *
## State_Full2RI-Rhode Island 3.524e+00 9.128e-01 3.861 0.000113 ***
## State_Full2SC-South Carolina 2.740e+00 1.172e+00 2.339 0.019357 *
## State_Full2SD-South Dakota -3.510e+00 9.095e-01 -3.859 0.000114 ***
## State_Full2TN-Tennessee 1.575e+00 1.310e+00 1.202 0.229375
## State_Full2TX-Texas -4.908e+00 2.341e+00 -2.097 0.035995 *
## State_Full2UT-Utah 1.130e+00 9.597e-01 1.177 0.239101
## State_Full2VA-Virginia 3.173e+00 1.278e+00 2.482 0.013077 *
## State_Full2VT-Vermont 1.972e+00 9.016e-01 2.188 0.028703 *
## State_Full2WA-Washington 1.927e+00 1.312e+00 1.469 0.141841
## State_Full2WI-Wisconsin 6.754e+00 1.253e+00 5.391 7.02e-08 ***
## State_Full2WV-West Virginia 2.887e+00 9.484e-01 3.044 0.002334 **
## State_Full2WY-Wyoming -1.637e+00 9.034e-01 -1.813 0.069900 .
## I(Year^2) 9.189e-03 7.062e-03 1.301 0.193207
## I(log(Denominator)) -2.975e+00 5.033e-01 -5.912 3.39e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2319.0 on 1691 degrees of freedom
## Residual deviance: 840.9 on 1634 degrees of freedom
## AIC: 956.9
##
## Number of Fisher Scoring iterations: 17
Evaluate model
# Likelihood ratio test: significance of the difference between the full model and the null model.
pchisq(fit_model_LR_Risk2$null.deviance - fit_model_LR_Risk2$deviance,
fit_model_LR_Risk2$df.null - fit_model_LR_Risk2$df.residual, lower.tail = FALSE)
## [1] 1.533547e-271
Check whether above value is very small (the smaller the more significant), e.g. < 0.05.
< How to perform a Logistic Regression in R > Michy Alice
https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/
# null deviance and the residual deviance
anova(fit_model_LR_Risk2, test="Chisq")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Prevalence_Risk2
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 1691 2318.98
## Denominator 1 3.54 1690 2315.44 0.06000 .
## Year 1 799.97 1689 1515.47 < 2.2e-16 ***
## Source 3 130.38 1686 1385.09 < 2.2e-16 ***
## State_Full2 50 503.15 1636 881.94 < 2.2e-16 ***
## I(Year^2) 1 3.12 1635 878.81 0.07711 .
## I(log(Denominator)) 1 37.91 1634 840.90 7.391e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# R^2 equivalent
if(!require(pscl)){install.packages("pscl")}
## Loading required package: pscl
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library("pscl")
While no exact equivalent to the \(R^2\) of linear regression exists, the McFadden \(R^2\) index can be used to assess the model fit.
# R^2 equivalent
pR2(fit_model_LR_Risk2)[4]
## McFadden
## 0.6373833
<p>
McFadden $R^2$ = 0.6374
</p>
LR Workshop Task: 3. c. Logistic regression (LR) Muti-Class. (Reuse Multiple Polynomial Model on categorical dependent variable.)
table(ASD_State_4_LR_Risk4$Prevalence_Risk4)
##
## Low Medium High Very High
## 740 584 294 74
barplot(table(ASD_State_4_LR_Risk4$Prevalence_Risk4))
counts = table(ASD_State_4_LR_Risk4$Prevalence_Risk4, ASD_State_4_LR_Risk4$Source)
counts
##
## addm medi nsch sped
## Low 5 344 0 391
## Medium 36 225 5 318
## High 38 74 38 144
## Very High 7 12 55 0
barplot(counts,
main="Prevalence Occurrence by Source and Risk",
xlab="Data Sources",
ylab="Occurrences",
col=c("lightyellow", "orange", "red","darkred"),
legend = rownames(counts),
args.legend = list(x = "topleft", bty = "n", cex = 0.85, y.intersp = 4))
Build model
# multinom function from the nnet package
if(!require(nnet)){install.packages("nnet")}
## Loading required package: nnet
library("nnet")
# Multi-Class Classification:
fit_model_LR_Risk4 = multinom(Prevalence_Risk4 ~ Denominator + Year + Source + State_Full2 + I(Year^2) + I(log(Denominator)),
data = ASD_State_4_LR_Risk4, maxit=1000) # maxit https://cran.r-project.org/web/packages/nnet/nnet.pdf
## # weights: 236 (174 variable)
## initial value 2345.610059
## iter 10 value 1642.176333
## iter 20 value 1287.779160
## iter 30 value 1107.430631
## iter 40 value 921.386084
## iter 50 value 890.347795
## iter 60 value 812.517810
## iter 70 value 808.708251
## iter 80 value 790.145541
## iter 90 value 789.356983
## iter 90 value 789.356983
## iter 100 value 788.697764
## iter 110 value 788.263896
## iter 120 value 787.147810
## iter 120 value 787.147808
## iter 130 value 785.704645
## iter 130 value 785.704644
## iter 140 value 785.135020
## iter 140 value 785.135020
## iter 150 value 784.905675
## iter 150 value 784.905675
## iter 160 value 784.812617
## iter 160 value 784.812617
## iter 170 value 784.774742
## iter 170 value 784.774742
## iter 180 value 784.759309
## iter 180 value 784.759309
## iter 190 value 784.753016
## iter 190 value 784.753016
## iter 200 value 784.750451
## iter 200 value 784.750451
## iter 210 value 784.749404
## iter 210 value 784.749404
## iter 220 value 784.748978
## iter 220 value 784.748978
## iter 230 value 784.748804
## iter 230 value 784.748804
## final value 784.748760
## converged
summary(fit_model_LR_Risk4)
## Call:
## multinom(formula = Prevalence_Risk4 ~ Denominator + Year + Source +
## State_Full2 + I(Year^2) + I(log(Denominator)), data = ASD_State_4_LR_Risk4,
## maxit = 1000)
##
## Coefficients:
## (Intercept) Denominator Year Sourcemedi Sourcensch
## Medium -0.0004007338 2.315392e-06 -0.6177611 -0.3983914 1.288252
## High -0.0012785664 4.217311e-06 -1.1823721 -2.5260908 0.260585
## Very High -0.0011016480 4.541402e-06 -1.6324251 2.4329304 -3.579447
## Sourcesped State_Full2AL-Alabama State_Full2AR-Arkansas
## Medium -0.5741604 0.259656 1.55153213
## High -4.1602188 1.793057 -0.04348038
## Very High -6.7861902 6.666128 -2.13524747
## State_Full2AZ-Arizona State_Full2CA-California State_Full2CO-Colorado
## Medium 1.6167595 -4.909023 -0.9979121
## High 2.5982487 -9.553742 -2.3874541
## Very High 0.5183466 -8.552350 -6.6330509
## State_Full2CT-Connecticut State_Full2DC-District of Columbia
## Medium 0.6888067 -3.360576
## High 2.8704430 -3.307447
## Very High 0.9797351 -5.173241
## State_Full2DE-Delaware State_Full2FL-Florida State_Full2GA-Georgia
## Medium -2.053038 -1.298922 1.3332883
## High -2.057146 -2.591858 1.0150253
## Very High -2.146310 -4.002177 0.7796848
## State_Full2HI-Hawaii State_Full2IA-Iowa State_Full2ID-Idaho
## Medium -1.906098 -2.279007 1.770407
## High -3.940876 -2.758616 3.491865
## Very High -7.794271 -4.414376 6.716912
## State_Full2IL-Illinois State_Full2IN-Indiana State_Full2KS-Kansas
## Medium 0.4698186 2.900020 0.6011796
## High -0.7865485 5.099685 -1.7097399
## Very High 1.9329550 5.232929 -4.0113460
## State_Full2KY-Kentucky State_Full2LA-Louisiana
## Medium 0.2443813 -1.898521
## High 0.1226346 -3.786383
## Very High 1.4148624 -5.282611
## State_Full2MA-Massachusetts State_Full2MD-Maryland
## Medium 0.6339059 2.920329
## High 3.1676739 4.244674
## Very High 3.9373067 3.480534
## State_Full2ME-Maine State_Full2MI-Michigan State_Full2MN-Minnesota
## Medium 2.486395 4.418793 5.227256
## High 8.020376 5.139910 11.144972
## Very High 11.401158 3.322766 11.611459
## State_Full2MO-Missouri State_Full2MS-Mississippi
## Medium 1.972230 -1.556161
## High 3.289085 -5.205541
## Very High 3.521192 -1.477672
## State_Full2MT-Montana State_Full2NC-North Carolina
## Medium -1.877082 2.0851871
## High -0.197909 2.4901862
## Very High -1.285098 0.5156278
## State_Full2ND-North Dakota State_Full2NE-Nebraska
## Medium -1.0216012 -2.189739
## High 0.3656576 -2.024072
## Very High -4.8018843 -4.879143
## State_Full2NH-New Hampshire State_Full2NJ-New Jersey
## Medium 0.7622948 1.584463
## High 3.7620099 3.174298
## Very High 1.4202079 7.779043
## State_Full2NM-New Mexico State_Full2NV-Nevada State_Full2NY-New York
## Medium -2.207321 -1.4251488 -0.003734705
## High -5.688137 -0.9769902 -1.027852534
## Very High -2.098040 -1.6479051 -2.507489677
## State_Full2OH-Ohio State_Full2OK-Oklahoma State_Full2OR-Oregon
## Medium 0.8580716 -0.6403649 5.188302
## High 1.4962143 -2.0139406 8.107611
## Very High 2.9609860 -4.0221542 8.045266
## State_Full2PA-Pennsylvania State_Full2RI-Rhode Island
## Medium 1.091974 1.824172
## High 2.409973 5.986608
## Very High 3.450578 4.974181
## State_Full2SC-South Carolina State_Full2SD-South Dakota
## Medium 0.6450362 -3.953263
## High -0.5909618 -6.017699
## Very High -2.4215065 -7.979805
## State_Full2TN-Tennessee State_Full2TX-Texas State_Full2UT-Utah
## Medium -0.3863012 -3.989071 -0.4175548
## High -2.5265653 -10.100360 -1.6451886
## Very High -4.1210715 -1.627015 -2.3027877
## State_Full2VA-Virginia State_Full2VT-Vermont State_Full2WA-Washington
## Medium 0.7890874 0.881496 -0.2522830
## High 2.7623520 3.880883 -0.4083035
## Very High 0.9597204 3.867580 -0.1751737
## State_Full2WI-Wisconsin State_Full2WV-West Virginia
## Medium 4.249644 0.9415586
## High 4.899023 2.0562665
## Very High 3.003220 2.3448054
## State_Full2WY-Wyoming I(Year^2) I(log(Denominator))
## Medium -1.916496 0.0003134994 -1.885972
## High -3.657686 0.0005966326 -2.642810
## Very High -6.460238 0.0008259125 -5.130174
##
## Std. Errors:
## (Intercept) Denominator Year Sourcemedi Sourcensch
## Medium 6.504869e-15 1.016564e-07 1.302425e-11 1.114663e-12 2.335311e-13
## High 1.237153e-14 1.417345e-07 2.095135e-11 2.338936e-12 2.691731e-12
## Very High 1.556105e-14 1.663810e-06 3.208907e-11 3.298394e-12 2.925267e-12
## Sourcesped State_Full2AL-Alabama State_Full2AR-Arkansas
## Medium 6.032300e-13 4.611155e-13 2.418736e-14
## High 4.842465e-13 5.054817e-13 4.399004e-14
## Very High 6.850639e-14 1.070782e-12 8.549246e-14
## State_Full2AZ-Arizona State_Full2CA-California State_Full2CO-Colorado
## Medium 4.381850e-14 6.176288e-14 2.215907e-14
## High 3.545111e-14 1.570842e-13 4.494078e-15
## Very High 8.425035e-14 6.526380e-14 3.901093e-14
## State_Full2CT-Connecticut State_Full2DC-District of Columbia
## Medium 3.297721e-14 1.882122e-14
## High 3.305750e-14 4.937714e-14
## Very High 8.701018e-14 1.871889e-15
## State_Full2DE-Delaware State_Full2FL-Florida State_Full2GA-Georgia
## Medium 2.759187e-15 3.693894e-14 3.524273e-14
## High 1.693835e-14 9.245737e-14 1.473422e-14
## Very High 4.743799e-14 7.911506e-14 8.362122e-15
## State_Full2HI-Hawaii State_Full2IA-Iowa State_Full2ID-Idaho
## Medium 1.813085e-14 1.733870e-15 1.204846e-13
## High 3.650909e-14 6.787245e-14 1.457238e-13
## Very High 4.725432e-14 8.730409e-14 2.455899e-13
## State_Full2IL-Illinois State_Full2IN-Indiana State_Full2KS-Kansas
## Medium 1.520165e-14 2.951863e-14 5.667162e-15
## High 1.545507e-14 5.134649e-14 1.026072e-14
## Very High 8.946492e-15 8.284287e-14 2.487088e-14
## State_Full2KY-Kentucky State_Full2LA-Louisiana
## Medium 8.399369e-15 1.042326e-14
## High 4.569650e-15 5.474020e-14
## Very High 5.752079e-15 6.911962e-14
## State_Full2MA-Massachusetts State_Full2MD-Maryland
## Medium 1.645795e-14 5.008280e-14
## High 3.631590e-14 1.108605e-13
## Very High 1.206869e-14 1.689210e-13
## State_Full2ME-Maine State_Full2MI-Michigan State_Full2MN-Minnesota
## Medium 1.553973e-14 2.953357e-14 4.294752e-14
## High 1.011270e-13 7.570489e-14 8.167658e-15
## Very High 6.169331e-14 8.877702e-14 4.157086e-14
## State_Full2MO-Missouri State_Full2MS-Mississippi
## Medium 4.204133e-14 2.101698e-14
## High 3.369476e-14 1.239228e-15
## Very High 8.024045e-14 1.186427e-14
## State_Full2MT-Montana State_Full2NC-North Carolina
## Medium 4.020754e-14 4.737792e-14
## High 1.932927e-14 4.809309e-14
## Very High 9.331166e-14 8.437890e-14
## State_Full2ND-North Dakota State_Full2NE-Nebraska
## Medium 4.733390e-14 6.068673e-15
## High 4.818736e-14 1.702983e-14
## Very High 3.160265e-14 1.743880e-14
## State_Full2NH-New Hampshire State_Full2NJ-New Jersey
## Medium 4.550843e-14 4.730878e-13
## High 3.099444e-15 5.256319e-13
## Very High 7.715746e-14 1.038871e-12
## State_Full2NM-New Mexico State_Full2NV-Nevada State_Full2NY-New York
## Medium 2.228400e-14 7.112411e-15 4.630727e-14
## High 1.536900e-15 3.405749e-14 9.901473e-14
## Very High 1.163183e-14 9.852721e-16 8.721041e-14
## State_Full2OH-Ohio State_Full2OK-Oklahoma State_Full2OR-Oregon
## Medium 1.598397e-14 1.239635e-14 4.717532e-14
## High 1.401742e-14 5.308381e-14 2.557711e-15
## Very High 2.210444e-15 7.628757e-14 5.286462e-14
## State_Full2PA-Pennsylvania State_Full2RI-Rhode Island
## Medium 4.030365e-14 4.488620e-14
## High 1.752958e-14 2.402991e-14
## Very High 3.561026e-14 5.043292e-14
## State_Full2SC-South Carolina State_Full2SD-South Dakota
## Medium 1.557550e-14 9.401830e-15
## High 4.641252e-14 1.658577e-14
## Very High 7.568193e-14 3.571812e-14
## State_Full2TN-Tennessee State_Full2TX-Texas State_Full2UT-Utah
## Medium 2.154977e-14 1.406481e-14 2.396506e-14
## High 5.974267e-14 3.518306e-14 9.415083e-14
## Very High 7.992758e-14 1.125899e-13 1.401813e-13
## State_Full2VA-Virginia State_Full2VT-Vermont State_Full2WA-Washington
## Medium 2.544850e-14 4.620749e-14 2.766294e-14
## High 5.860562e-14 2.523099e-14 6.774583e-14
## Very High 8.674593e-14 9.937702e-14 9.407003e-14
## State_Full2WI-Wisconsin State_Full2WV-West Virginia
## Medium 5.186528e-14 5.540629e-15
## High 4.334490e-14 3.743788e-14
## Very High 1.007639e-13 1.454464e-14
## State_Full2WY-Wyoming I(Year^2) I(log(Denominator))
## Medium 6.722288e-15 2.610422e-08 1.928104e-12
## High 1.023602e-14 3.820346e-08 1.410502e-11
## Very High 2.508895e-14 6.797764e-08 1.786641e-11
##
## Residual Deviance: 1569.498
## AIC: 1917.498
< MULTINOMIAL LOGISTIC REGRESSION | R DATA ANALYSIS EXAMPLES >
https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/
## extract the coefficients from the model and exponentiate
exp(coef(fit_model_LR_Risk4))
## (Intercept) Denominator Year Sourcemedi Sourcensch Sourcesped
## Medium 0.9995993 1.000002 0.5391502 0.67139922 3.62644214 0.563177522
## High 0.9987223 1.000004 0.3065507 0.07997104 1.29768902 0.015604143
## Very High 0.9988990 1.000005 0.1954550 11.39221703 0.02789111 0.001129263
## State_Full2AL-Alabama State_Full2AR-Arkansas State_Full2AZ-Arizona
## Medium 1.296484 4.7186943 5.036742
## High 6.007789 0.9574513 13.440179
## Very High 785.348612 0.1182153 1.679249
## State_Full2CA-California State_Full2CO-Colorado
## Medium 7.379693e-03 0.368648343
## High 7.093535e-05 0.091863261
## Very High 1.930908e-04 0.001316142
## State_Full2CT-Connecticut State_Full2DC-District of Columbia
## Medium 1.991338 0.034715260
## High 17.644834 0.036609512
## Very High 2.663751 0.005666173
## State_Full2DE-Delaware State_Full2FL-Florida State_Full2GA-Georgia
## Medium 0.1283444 0.27282579 3.793497
## High 0.1278183 0.07488077 2.759433
## Very High 0.1169147 0.01827580 2.180785
## State_Full2HI-Hawaii State_Full2IA-Iowa State_Full2ID-Idaho
## Medium 0.148659345 0.10238578 5.873246
## High 0.019431181 0.06337940 32.847135
## Very High 0.000412089 0.01210211 826.261954
## State_Full2IL-Illinois State_Full2IN-Indiana State_Full2KS-Kansas
## Medium 1.5997040 18.1745 1.8242694
## High 0.4554139 163.9703 0.1809128
## Very High 6.9098986 187.3408 0.0181090
## State_Full2KY-Kentucky State_Full2LA-Louisiana
## Medium 1.276831 0.149790014
## High 1.130471 0.022677484
## Very High 4.115920 0.005079153
## State_Full2MA-Massachusetts State_Full2MD-Maryland
## Medium 1.884959 18.54739
## High 23.752169 69.73300
## Very High 51.280300 32.47705
## State_Full2ME-Maine State_Full2MI-Michigan State_Full2MN-Minnesota
## Medium 12.01788 82.99607 186.2809
## High 3042.32231 170.70035 69214.9296
## Very High 89425.22954 27.73696 110355.1501
## State_Full2MO-Missouri State_Full2MS-Mississippi
## Medium 7.186686 0.210944310
## High 26.818327 0.005486084
## Very High 33.824707 0.228168268
## State_Full2MT-Montana State_Full2NC-North Carolina
## Medium 0.1530360 8.046097
## High 0.8204445 12.063522
## Very High 0.2766236 1.674689
## State_Full2ND-North Dakota State_Full2NE-Nebraska
## Medium 0.360018027 0.111945929
## High 1.441461648 0.132116387
## Very High 0.008214255 0.007603527
## State_Full2NH-New Hampshire State_Full2NJ-New Jersey
## Medium 2.143189 4.876673
## High 43.034836 23.910026
## Very High 4.137981 2389.985984
## State_Full2NM-New Mexico State_Full2NV-Nevada State_Full2NY-New York
## Medium 0.109994977 0.2404727 0.99627226
## High 0.003385895 0.3764424 0.35777444
## Very High 0.122696725 0.1924526 0.08147251
## State_Full2OH-Ohio State_Full2OK-Oklahoma State_Full2OR-Oregon
## Medium 2.358608 0.52710006 179.164
## High 4.464755 0.13346171 3319.637
## Very High 19.317009 0.01791433 3118.994
## State_Full2PA-Pennsylvania State_Full2RI-Rhode Island
## Medium 2.98015 6.197663
## High 11.13366 398.062173
## Very High 31.51861 144.630278
## State_Full2SC-South Carolina State_Full2SD-South Dakota
## Medium 1.90605600 0.0191919800
## High 0.55379441 0.0024352668
## Very High 0.08878776 0.0003423061
## State_Full2TN-Tennessee State_Full2TX-Texas State_Full2UT-Utah
## Medium 0.67956583 1.851691e-02 0.65865543
## High 0.07993310 4.106477e-05 0.19297616
## Very High 0.01622712 1.965154e-01 0.09997974
## State_Full2VA-Virginia State_Full2VT-Vermont State_Full2WA-Washington
## Medium 2.201387 2.414509 0.7770248
## High 15.837048 48.466974 0.6647771
## Very High 2.610966 47.826518 0.8393112
## State_Full2WI-Wisconsin State_Full2WV-West Virginia
## Medium 70.08045 2.563975
## High 134.15860 7.816731
## Very High 20.15031 10.431243
## State_Full2WY-Wyoming I(Year^2) I(log(Denominator))
## Medium 0.147121636 1.000314 0.15168154
## High 0.025792134 1.000597 0.07116099
## Very High 0.001564424 1.000826 0.00591553
# Uncomment below to display all p values:
# paste(exp(coef(fit_model)))
# Test the significance/importance of each coefficient (check if p values < 0.05)
# z score for coefficients
z <- summary(fit_model_LR_Risk4)$coefficients/summary(fit_model_LR_Risk4)$standard.errors
cat('\n< Talbe of coefficient z scores>')
##
## < Talbe of coefficient z scores>
z
## (Intercept) Denominator Year Sourcemedi Sourcensch
## Medium -61605204301 22.77665 -47431585790 -3.574096e+11 5.516405e+12
## High -103347504827 29.75500 -56434165146 -1.080017e+12 9.680944e+10
## Very High -70795238650 2.72952 -50871679968 7.376106e+11 -1.223631e+12
## Sourcesped State_Full2AL-Alabama State_Full2AR-Arkansas
## Medium -9.518100e+11 5.631040e+11 6.414641e+13
## High -8.591119e+12 3.547224e+12 -9.884141e+11
## Very High -9.905923e+13 6.225476e+12 -2.497586e+13
## State_Full2AZ-Arizona State_Full2CA-California State_Full2CO-Colorado
## Medium 3.689674e+13 -7.948178e+13 -4.503402e+13
## High 7.329104e+13 -6.081924e+13 -5.312445e+14
## Very High 6.152456e+12 -1.310428e+14 -1.700306e+14
## State_Full2CT-Connecticut State_Full2DC-District of Columbia
## Medium 2.088735e+13 -1.785525e+14
## High 8.683181e+13 -6.698337e+13
## Very High 1.126001e+13 -2.763647e+15
## State_Full2DE-Delaware State_Full2FL-Florida State_Full2GA-Georgia
## Medium -7.440734e+14 -3.516403e+13 3.783159e+13
## High -1.214490e+14 -2.803301e+13 6.888899e+13
## Very High -4.524454e+13 -5.058680e+13 9.324007e+13
## State_Full2HI-Hawaii State_Full2IA-Iowa State_Full2ID-Idaho
## Medium -1.051301e+14 -1.314405e+15 1.469405e+13
## High -1.079423e+14 -4.064413e+13 2.396221e+13
## Very High -1.649430e+14 -5.056322e+13 2.735012e+13
## State_Full2IL-Illinois State_Full2IN-Indiana State_Full2KS-Kansas
## Medium 3.090576e+13 9.824372e+13 1.060812e+14
## High -5.089257e+13 9.931906e+13 -1.666297e+14
## Very High 2.160573e+14 6.316692e+13 -1.612869e+14
## State_Full2KY-Kentucky State_Full2LA-Louisiana
## Medium 2.909520e+13 -1.821427e+14
## High 2.683677e+13 -6.917006e+13
## Very High 2.459741e+14 -7.642708e+13
## State_Full2MA-Massachusetts State_Full2MD-Maryland
## Medium 3.851670e+13 5.831002e+13
## High 8.722554e+13 3.828842e+13
## Very High 3.262415e+14 2.060451e+13
## State_Full2ME-Maine State_Full2MI-Michigan State_Full2MN-Minnesota
## Medium 1.600025e+14 1.496193e+14 1.217126e+14
## High 7.930991e+13 6.789402e+13 1.364525e+15
## Very High 1.848038e+14 3.742822e+13 2.793172e+14
## State_Full2MO-Missouri State_Full2MS-Mississippi
## Medium 4.691169e+13 -7.404304e+13
## High 9.761416e+13 -4.200632e+15
## Very High 4.388300e+13 -1.245481e+14
## State_Full2MT-Montana State_Full2NC-North Carolina
## Medium -4.668484e+13 4.401179e+13
## High -1.023882e+13 5.177846e+13
## Very High -1.377210e+13 6.110861e+12
## State_Full2ND-North Dakota State_Full2NE-Nebraska
## Medium -2.158286e+13 -3.608267e+14
## High 7.588249e+12 -1.188545e+14
## Very High -1.519456e+14 -2.797867e+14
## State_Full2NH-New Hampshire State_Full2NJ-New Jersey
## Medium 1.675063e+13 3.349195e+12
## High 1.213769e+15 6.039013e+12
## Very High 1.840662e+13 7.487980e+12
## State_Full2NM-New Mexico State_Full2NV-Nevada State_Full2NY-New York
## Medium -9.905406e+13 -2.003749e+14 -8.065052e+10
## High -3.701046e+15 -2.868650e+13 -1.038080e+13
## Very High -1.803705e+14 -1.672538e+15 -2.875218e+13
## State_Full2OH-Ohio State_Full2OK-Oklahoma State_Full2OR-Oregon
## Medium 5.368327e+13 -5.165753e+13 1.099791e+14
## High 1.067396e+14 -3.793888e+13 3.169869e+15
## Very High 1.339543e+15 -5.272359e+13 1.521862e+14
## State_Full2PA-Pennsylvania State_Full2RI-Rhode Island
## Medium 2.709366e+13 4.063994e+13
## High 1.374803e+14 2.491315e+14
## Very High 9.689842e+13 9.862965e+13
## State_Full2SC-South Carolina State_Full2SD-South Dakota
## Medium 4.141352e+13 -4.204780e+14
## High -1.273281e+13 -3.628230e+14
## Very High -3.199583e+13 -2.234106e+14
## State_Full2TN-Tennessee State_Full2TX-Texas State_Full2UT-Utah
## Medium -1.792600e+13 -2.836206e+14 -1.742348e+13
## High -4.229080e+13 -2.870802e+14 -1.747397e+13
## Very High -5.156007e+13 -1.445080e+13 -1.642721e+13
## State_Full2VA-Virginia State_Full2VT-Vermont State_Full2WA-Washington
## Medium 3.100723e+13 1.907691e+13 -9.119890e+12
## High 4.713459e+13 1.538141e+14 -6.026990e+12
## Very High 1.106358e+13 3.891825e+13 -1.862163e+12
## State_Full2WI-Wisconsin State_Full2WV-West Virginia
## Medium 8.193619e+13 1.699371e+14
## High 1.130242e+14 5.492476e+13
## Very High 2.980451e+13 1.612144e+14
## State_Full2WY-Wyoming I(Year^2) I(log(Denominator))
## Medium -2.850957e+14 12009.53 -978148399524
## High -3.573347e+14 15617.24 -187366682293
## Very High -2.574934e+14 12149.77 -287140725316
# p value of 2-tailed z test
p <- (1 - pnorm(abs(z), 0, 1)) * 2
cat('\n< Talbe of coefficient p values>')
##
## < Talbe of coefficient p values>
p
## (Intercept) Denominator Year Sourcemedi Sourcensch Sourcesped
## Medium 0 0.000000000 0 0 0 0
## High 0 0.000000000 0 0 0 0
## Very High 0 0.006342669 0 0 0 0
## State_Full2AL-Alabama State_Full2AR-Arkansas State_Full2AZ-Arizona
## Medium 0 0 0
## High 0 0 0
## Very High 0 0 0
## State_Full2CA-California State_Full2CO-Colorado
## Medium 0 0
## High 0 0
## Very High 0 0
## State_Full2CT-Connecticut State_Full2DC-District of Columbia
## Medium 0 0
## High 0 0
## Very High 0 0
## State_Full2DE-Delaware State_Full2FL-Florida State_Full2GA-Georgia
## Medium 0 0 0
## High 0 0 0
## Very High 0 0 0
## State_Full2HI-Hawaii State_Full2IA-Iowa State_Full2ID-Idaho
## Medium 0 0 0
## High 0 0 0
## Very High 0 0 0
## State_Full2IL-Illinois State_Full2IN-Indiana State_Full2KS-Kansas
## Medium 0 0 0
## High 0 0 0
## Very High 0 0 0
## State_Full2KY-Kentucky State_Full2LA-Louisiana
## Medium 0 0
## High 0 0
## Very High 0 0
## State_Full2MA-Massachusetts State_Full2MD-Maryland
## Medium 0 0
## High 0 0
## Very High 0 0
## State_Full2ME-Maine State_Full2MI-Michigan State_Full2MN-Minnesota
## Medium 0 0 0
## High 0 0 0
## Very High 0 0 0
## State_Full2MO-Missouri State_Full2MS-Mississippi
## Medium 0 0
## High 0 0
## Very High 0 0
## State_Full2MT-Montana State_Full2NC-North Carolina
## Medium 0 0
## High 0 0
## Very High 0 0
## State_Full2ND-North Dakota State_Full2NE-Nebraska
## Medium 0 0
## High 0 0
## Very High 0 0
## State_Full2NH-New Hampshire State_Full2NJ-New Jersey
## Medium 0 0
## High 0 0
## Very High 0 0
## State_Full2NM-New Mexico State_Full2NV-Nevada State_Full2NY-New York
## Medium 0 0 0
## High 0 0 0
## Very High 0 0 0
## State_Full2OH-Ohio State_Full2OK-Oklahoma State_Full2OR-Oregon
## Medium 0 0 0
## High 0 0 0
## Very High 0 0 0
## State_Full2PA-Pennsylvania State_Full2RI-Rhode Island
## Medium 0 0
## High 0 0
## Very High 0 0
## State_Full2SC-South Carolina State_Full2SD-South Dakota
## Medium 0 0
## High 0 0
## Very High 0 0
## State_Full2TN-Tennessee State_Full2TX-Texas State_Full2UT-Utah
## Medium 0 0 0
## High 0 0 0
## Very High 0 0 0
## State_Full2VA-Virginia State_Full2VT-Vermont State_Full2WA-Washington
## Medium 0 0 0
## High 0 0 0
## Very High 0 0 0
## State_Full2WI-Wisconsin State_Full2WV-West Virginia
## Medium 0 0
## High 0 0
## Very High 0 0
## State_Full2WY-Wyoming I(Year^2) I(log(Denominator))
## Medium 0 0 0
## High 0 0 0
## Very High 0 0 0
# Uncomment below to display all p values:
# paste(p)
While no exact equivalent to the \(R^2\) of linear regression exists, the McFadden \(R^2\) index can be used to assess the model fit.
# R^2 equivalent
pR2(fit_model_LR_Risk4)[4]
## fitting null model for pseudo-r2
## # weights: 8 (3 variable)
## initial value 2345.610059
## iter 10 value 1979.347988
## final value 1979.347206
## converged
## McFadden
## 0.6035315
<p>
McFadden $R^2$ = 0.6035
</p>
<a href="">
<img src="" width="750" align="center">
</a>
<h3>
Linear Model: Linear Model: Model Evaluation - Workshop Task
</h3>
Workshop Task:
Model Evaluation Workshop Task: 1. a. Train/Test Dataset Split
if(!require(caTools)){install.packages("caTools")}
## Loading required package: caTools
library("caTools")
# Generate a random number sequence that can be reproduced to check results thru the seed number.
set.seed(88)
# Stratified Random Sampling: split dataset into two sets in predefined proportion (SplitRatio)
# while preserving differnt class ratios of dependent variable. (e.g. Proportion of Low/High)
split <- sample.split(ASD_State_4_LR_Risk2$Prevalence_Risk2, SplitRatio = 0.7)
# Get training and test data
trainset <- subset(ASD_State_4_LR_Risk2, split == TRUE)
testset <- subset(ASD_State_4_LR_Risk2, split == FALSE)
Build a binary classification model to predict (categorical) Prevalence Risk Level using Logistic Regression (LR)
# Binary Classification:
fit_model_LR_Risk2 = glm(Prevalence_Risk2 ~ Denominator + Year + Source + State_Full2 + I(Year^2) + I(log(Denominator)),
family=binomial(link='logit'), data = trainset) # data = trainset
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit_model_LR_Risk2)
##
## Call:
## glm(formula = Prevalence_Risk2 ~ Denominator + Year + Source +
## State_Full2 + I(Year^2) + I(log(Denominator)), family = binomial(link = "logit"),
## data = trainset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2887 -0.2840 0.0000 0.2573 2.9945
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.763e+04 3.538e+04 1.629 0.103311
## Denominator 4.162e-06 6.157e-07 6.760 1.38e-11 ***
## Year -5.815e+01 3.526e+01 -1.649 0.099171 .
## Sourcemedi 1.118e+00 1.445e+00 0.774 0.439156
## Sourcensch 1.054e+01 7.542e+02 0.014 0.988849
## Sourcesped 7.207e-01 1.804e+00 0.399 0.689548
## State_Full2AL-Alabama 1.431e+00 1.343e+00 1.066 0.286634
## State_Full2AR-Arkansas 3.125e+00 1.264e+00 2.472 0.013421 *
## State_Full2AZ-Arizona 3.938e+00 1.491e+00 2.641 0.008263 **
## State_Full2CA-California -1.017e+01 3.302e+00 -3.079 0.002076 **
## State_Full2CO-Colorado 3.551e-01 1.310e+00 0.271 0.786395
## State_Full2CT-Connecticut 2.313e+00 1.167e+00 1.982 0.047432 *
## State_Full2DC-District of Columbia -3.179e+00 1.050e+00 -3.029 0.002457 **
## State_Full2DE-Delaware -1.295e+00 1.093e+00 -1.185 0.236160
## State_Full2FL-Florida -1.461e+00 1.841e+00 -0.794 0.427311
## State_Full2GA-Georgia 2.816e+00 1.582e+00 1.780 0.075089 .
## State_Full2HI-Hawaii -5.153e-01 9.723e-01 -0.530 0.596133
## State_Full2IA-Iowa -1.417e+00 1.175e+00 -1.205 0.228072
## State_Full2ID-Idaho 3.280e+00 1.034e+00 3.171 0.001519 **
## State_Full2IL-Illinois 1.257e+00 1.800e+00 0.698 0.484964
## State_Full2IN-Indiana 5.164e+00 1.490e+00 3.465 0.000530 ***
## State_Full2KS-Kansas 1.823e+00 1.128e+00 1.616 0.106019
## State_Full2KY-Kentucky 2.163e+00 1.366e+00 1.583 0.113318
## State_Full2LA-Louisiana -8.178e-01 1.387e+00 -0.589 0.555579
## State_Full2MA-Massachusetts 2.647e+00 1.391e+00 1.904 0.056939 .
## State_Full2MD-Maryland 5.012e+00 1.410e+00 3.556 0.000377 ***
## State_Full2ME-Maine 4.508e+00 1.062e+00 4.246 2.18e-05 ***
## State_Full2MI-Michigan 7.193e+00 1.754e+00 4.100 4.14e-05 ***
## State_Full2MN-Minnesota 7.660e+00 1.628e+00 4.705 2.53e-06 ***
## State_Full2MO-Missouri 3.821e+00 1.449e+00 2.637 0.008354 **
## State_Full2MS-Mississippi -2.675e-01 1.284e+00 -0.208 0.834937
## State_Full2MT-Montana -5.011e-01 9.449e-01 -0.530 0.595874
## State_Full2NC-North Carolina 3.973e+00 1.597e+00 2.488 0.012848 *
## State_Full2ND-North Dakota -4.695e-01 9.938e-01 -0.472 0.636644
## State_Full2NE-Nebraska -6.207e-01 1.069e+00 -0.581 0.561543
## State_Full2NH-New Hampshire 2.147e+00 9.642e-01 2.227 0.025944 *
## State_Full2NJ-New Jersey 3.282e+00 1.422e+00 2.307 0.021054 *
## State_Full2NM-New Mexico -1.324e+00 1.217e+00 -1.088 0.276734
## State_Full2NV-Nevada -6.188e-01 1.042e+00 -0.594 0.552734
## State_Full2NY-New York -9.957e-02 1.915e+00 -0.052 0.958533
## State_Full2OH-Ohio 1.624e+00 1.714e+00 0.947 0.343477
## State_Full2OK-Oklahoma 6.620e-01 1.293e+00 0.512 0.608721
## State_Full2OR-Oregon 2.351e+01 1.511e+03 0.016 0.987583
## State_Full2PA-Pennsylvania 2.252e+00 1.628e+00 1.383 0.166624
## State_Full2RI-Rhode Island 3.716e+00 1.116e+00 3.331 0.000866 ***
## State_Full2SC-South Carolina 2.326e+00 1.282e+00 1.815 0.069583 .
## State_Full2SD-South Dakota -3.282e+00 9.832e-01 -3.338 0.000843 ***
## State_Full2TN-Tennessee 4.190e-01 1.487e+00 0.282 0.778165
## State_Full2TX-Texas -7.006e+00 2.619e+00 -2.675 0.007466 **
## State_Full2UT-Utah 9.209e-01 1.077e+00 0.855 0.392485
## State_Full2VA-Virginia 2.424e+00 1.475e+00 1.643 0.100321
## State_Full2VT-Vermont 2.200e+00 1.064e+00 2.067 0.038706 *
## State_Full2WA-Washington 2.264e+00 1.465e+00 1.545 0.122328
## State_Full2WI-Wisconsin 6.659e+00 1.414e+00 4.709 2.49e-06 ***
## State_Full2WV-West Virginia 2.207e+00 1.063e+00 2.075 0.037965 *
## State_Full2WY-Wyoming -1.696e+00 1.050e+00 -1.614 0.106509
## I(Year^2) 1.467e-02 8.788e-03 1.670 0.094995 .
## I(log(Denominator)) -2.728e+00 5.762e-01 -4.734 2.20e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1622.82 on 1183 degrees of freedom
## Residual deviance: 579.43 on 1126 degrees of freedom
## AIC: 695.43
##
## Number of Fisher Scoring iterations: 18
Model Evaluation Workshop Task: 2. b. Confusion Matrix & Accuracy for Classification
https://en.wikipedia.org/wiki/Confusion_matrix
# Confusion matrix on Trainset
probTrainset <- predict(fit_model_LR_Risk2, type = 'response')
# One way is to use the proportion of High risk in the *Training* data.
threshold2 <- sum(trainset$Prevalence_Risk2 == "High")/length(trainset$Prevalence_Risk2)
cat('Trainset High Risk Threshold = ', threshold2)
## Trainset High Risk Threshold = 0.5625
# If logistic regression probability > threshold, predict High, else predict Low.
predictTrainset <- ifelse(probTrainset > threshold2, "High", "Low")
# Create a contingency table (Confusion Matrix) with actuals on rows and predictions on columns.
table(trainset$Prevalence_Risk2, predictTrainset)
## predictTrainset
## High Low
## Low 46 472
## High 593 73
# Accuracy on Trainset
AccuracyTrain <- mean(predictTrainset == trainset$Prevalence_Risk2)
cat('Trainset Accuracy = ', AccuracyTrain)
## Trainset Accuracy = 0.8994932
# Confusion matrix on Testset
probTestset <- predict(fit_model_LR_Risk2, newdata = testset, type = 'response') # newdata = testset
# One way is to use the proportion of High risk in the *Training* data.
cat('Reused Trainset High Risk Threshold = ', threshold2)
## Reused Trainset High Risk Threshold = 0.5625
# If logistic regression probability > threshold, predict High, else predict Low.
predictTestset <- ifelse(probTestset > threshold2, "High", "Low")
# Create a contingency table (Confusion Matrix) with actuals on rows and predictions on columns.
table(testset$Prevalence_Risk2, predictTestset)
## predictTestset
## High Low
## Low 30 192
## High 258 28
# Accuracy on Trainset
AccuracyTest <- mean(predictTestset == testset$Prevalence_Risk2)
cat('Testset Accuracy = ', AccuracyTest)
## Testset Accuracy = 0.8858268
Model Evaluation Workshop Task: 3. c. K-Fold Cross Validation
The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:
http://topepo.github.io/caret/index.html
if(!require(caret)){install.packages("caret")}
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
library("caret")
if(!require(e1071)){install.packages("e1071")}
## Loading required package: e1071
library("e1071")
# Caret train/test split method:
set.seed(88)
caret_idx = createDataPartition(ASD_State_4_LR_Risk2$Prevalence_Risk2, p = 0.85, list = FALSE)
caret_trainset = ASD_State_4_LR_Risk2[caret_idx, ]
caret_testset = ASD_State_4_LR_Risk2[-caret_idx, ]
#caret_threshold <- sum(caret_trainset$Prevalence_Risk2 == "Low")/length(caret_trainset$Prevalence_Risk2)
#caret_threshold
# Caret logistic regresion
set.seed(88)
cv_control = trainControl(method = "cv", number = 5)
caret_model_LR_Risk2 = train(form = Prevalence_Risk2 ~ ., data = caret_trainset,
trControl = cv_control, method = "glm", family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#
caret_model_LR_Risk2
## Generalized Linear Model
##
## 1439 samples
## 4 predictors
## 2 classes: 'Low', 'High'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1151, 1151, 1151, 1151, 1152
## Resampling results:
##
## Accuracy Kappa
## 0.8638018 0.7237422
# Get predicted class label
caret_model_LR_Risk2_Pred <- predict(caret_model_LR_Risk2, caret_testset)
# Uncomment below to get class probability
# caret_model_LR_Risk2_Pred <- predict(caret_model_LR_Risk2, caret_testset, type = "prob")
https://topepo.github.io/caret/using-your-own-model-in-train.html#Illustration5
# CM & Acc
cm_table <- table(as.factor(caret_model_LR_Risk2_Pred), caret_testset$Prevalence_Risk2)
confusionMatrix(cm_table)
## Confusion Matrix and Statistics
##
##
## Low High
## Low 92 11
## High 19 131
##
## Accuracy : 0.8814
## 95% CI : (0.8351, 0.9185)
## No Information Rate : 0.5613
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7573
##
## Mcnemar's Test P-Value : 0.2012
##
## Sensitivity : 0.8288
## Specificity : 0.9225
## Pos Pred Value : 0.8932
## Neg Pred Value : 0.8733
## Prevalence : 0.4387
## Detection Rate : 0.3636
## Detection Prevalence : 0.4071
## Balanced Accuracy : 0.8757
##
## 'Positive' Class : Low
##
Model Evaluation Workshop Task: 4. d. \(R^2\), MSE, RMSE for Regression
# Caret train/test split method:
set.seed(88)
caret_idx = createDataPartition(ASD_State_4_MLR$Prevalence, p = 0.85, list = FALSE)
caret_trainset = ASD_State_4_MLR[caret_idx, ]
caret_testset = ASD_State_4_MLR[-caret_idx, ]
# Look at below plots of train and test, shape of distribution are similar, which means good sampling.
par(mfrow=c(1, 2))
plot(density(caret_trainset$Prevalence))
plot(density(caret_testset$Prevalence))
par(mfrow=c(1, 1))
Build a regression model to predict (numeric) Prevalanece using Multiple Linear Regression (MLR)
if(!require(elasticnet)){install.packages("elasticnet")}
## Loading required package: elasticnet
## Loading required package: lars
## Loaded lars 1.2
library("elasticnet")
# Caret MLR regresion
set.seed(88)
cv_control = trainControl(method = "cv", number = 5)
#
caret_model_MLR <- train(Prevalence ~ ., data = caret_trainset, trControl = cv_control, method = "lm")
#
caret_model_MLR
## Linear Regression
##
## 1440 samples
## 4 predictors
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1152, 1153, 1152, 1152, 1151
## Resampling results:
##
## RMSE Rsquared MAE
## 2.960949 0.7460988 1.990705
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=8)
ggplot(varImp(caret_model_MLR))
# plot(varImp(caret_model_MLR))
caret_model_MLR_Pred <- predict(caret_model_MLR, caret_testset)
head(caret_model_MLR_Pred)
## 22 27 33 36 41 55
## 9.389012 7.831519 9.670434 10.609628 12.070592 13.402663
# MSE
mean((caret_testset$Prevalence - caret_model_MLR_Pred)^2)
## [1] 5.837448
# RMSE
sqrt(mean((caret_testset$Prevalence - caret_model_MLR_Pred)^2))
## [1] 2.416081
Calculate \(R^2\) for entire train set (all folds) https://stackoverflow.com/questions/25691127/r-squared-on-test-data
caret_model_MLR_Pred_Train <- predict(caret_model_MLR, caret_trainset)
SS.total <- sum((caret_trainset$Prevalence-mean(caret_trainset$Prevalence))^2)
SS.residual <- sum(residuals(caret_model_MLR)^2)
SS.regression <- sum((caret_model_MLR_Pred_Train-mean(caret_trainset$Prevalence))^2)
# fraction of variability explained by the model
cat("\nTrain R^2 fraction of variability explained by the model :", SS.regression/SS.total )
##
## Train R^2 fraction of variability explained by the model : 0.7699826
# caret_model_MLR_L2
Calculate \(R^2\) for test set https://stackoverflow.com/questions/25691127/r-squared-on-test-data
# True y values:
# head(caret_testset$Prevalence)
# Predicated y values (y hat):
# head(caret_model_MLR_Pred)
SS.total <- sum((caret_testset$Prevalence - mean(caret_testset$Prevalence))^2)
SS.residual <- sum((caret_testset$Prevalence - caret_model_MLR_Pred)^2)
SS.regression <- sum((caret_model_MLR_Pred - mean(caret_testset$Prevalence))^2)
# NOT the fraction of variability explained by the model
test.rsq <- 1 - SS.residual/SS.total
# cat("\nNOT Test fraction of variability explained by the model :", test.rsq )
# fraction of variability explained by the model
cat("\nTest R^2 fraction of variability explained by the model :", SS.regression/SS.total )
##
## Test R^2 fraction of variability explained by the model : 0.8988214
<a href="">
<img src="" width="750" align="center">
</a>
<h3>
Linear Model: Prevent Overfitting by Regularization Methods - L1 Lasso Regression & L2 Ridge regression
</h3>
https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c
L1 Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function.
L2 Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function.
The key difference between these techniques is that L1 Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.
https://towardsdatascience.com/create-predictive-models-in-r-with-caret-12baf9941236
Enhanced MLR using Regularization: L1 Lasso
if(!require(elasticnet)){install.packages("elasticnet")}
library("elasticnet")
# Caret MLR with Regularization
# possible method: boot", "boot632", "cv", "repeatedcv", "LOOCV", "LGOCV"
cv_control <- trainControl(method = "repeatedcv",
number = 10, # number of folds
repeats = 5) # repeated N times
caret_model_MLR_L1 <- train(Prevalence ~ .,
data = caret_trainset,
method = "lasso", # Try using "ridge"
trControl = cv_control,
preProcess = c('scale', 'center')) # Auto pre-process data
#
caret_model_MLR_L1
## The lasso
##
## 1440 samples
## 4 predictors
##
## Pre-processing: scaled (55), centered (55)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1296, 1295, 1296, 1296, 1296, 1297, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.1 4.594316 0.5964018 3.264456
## 0.5 3.158034 0.7178476 2.140953
## 0.9 2.944008 0.7526252 1.969675
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.9.
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=2)
ggplot(varImp(caret_model_MLR_L1))
# plot(varImp(caret_model_MLR_L1))
caret_model_MLR_Pred <- predict(caret_model_MLR_L1, caret_testset)
head(caret_model_MLR_Pred)
## 22 27 33 36 41 55
## 9.095595 7.657583 9.711652 10.420214 11.766911 13.097625
# MSE
mean((caret_testset$Prevalence - caret_model_MLR_Pred)^2)
## [1] 5.900344
# RMSE
sqrt(mean((caret_testset$Prevalence - caret_model_MLR_Pred)^2))
## [1] 2.429062
Calculate \(R^2\) for entire train set (all folds) https://stackoverflow.com/questions/25691127/r-squared-on-test-data
caret_model_MLR_Pred_Train <- predict(caret_model_MLR_L1, caret_trainset)
SS.total <- sum((caret_trainset$Prevalence-mean(caret_trainset$Prevalence))^2)
SS.residual <- sum(residuals(caret_model_MLR_L1)^2)
SS.regression <- sum((caret_model_MLR_Pred_Train-mean(caret_trainset$Prevalence))^2)
# fraction of variability explained by the model
cat("\nTrain R^2 fraction of variability explained by the model :", SS.regression/SS.total )
##
## Train R^2 fraction of variability explained by the model : 0.7498116
# caret_model_MLR_L2
Calculate \(R^2\) for test set https://stackoverflow.com/questions/25691127/r-squared-on-test-data
# True y values:
# head(caret_testset$Prevalence)
# Predicated y values (y hat):
# head(caret_model_MLR_Pred)
SS.total <- sum((caret_testset$Prevalence - mean(caret_testset$Prevalence))^2)
SS.residual <- sum((caret_testset$Prevalence - caret_model_MLR_Pred)^2)
SS.regression <- sum((caret_model_MLR_Pred - mean(caret_testset$Prevalence))^2)
# NOT the fraction of variability explained by the model
test.rsq <- 1 - SS.residual/SS.total
# cat("\nNOT Test fraction of variability explained by the model :", test.rsq )
# fraction of variability explained by the model
cat("\nTest R^2 fraction of variability explained by the model :", SS.regression/SS.total )
##
## Test R^2 fraction of variability explained by the model : 0.8798161
Enhanced MLR using Regularization: L2 Ridge
# Caret MLR with Regularization
# possible method: boot", "boot632", "cv", "repeatedcv", "LOOCV", "LGOCV"
cv_control <- trainControl(method = "repeatedcv",
number = 10, # number of folds
repeats = 5) # repeated N times
caret_model_MLR_L2 <- train(Prevalence ~ .,
data = caret_trainset,
method = "ridge", # Try using "lasso"
trControl = cv_control,
preProcess = c('scale', 'center')) # Auto pre-process data
#
caret_model_MLR_L2
## Ridge Regression
##
## 1440 samples
## 4 predictors
##
## Pre-processing: scaled (55), centered (55)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1297, 1296, 1297, 1296, 1295, 1297, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0e+00 2.949273 0.7532727 1.971258
## 1e-04 2.949261 0.7532723 1.971315
## 1e-01 3.000592 0.7444626 2.054192
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 1e-04.
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=2)
ggplot(varImp(caret_model_MLR_L2))
# plot(varImp(caret_model_MLR_L2))
caret_model_MLR_Pred <- predict(caret_model_MLR_L2, caret_testset)
head(caret_model_MLR_Pred)
## 22 27 33 36 41 55
## 9.384760 7.826531 9.667148 10.606323 12.066339 13.398437
# MSE
mean((caret_testset$Prevalence - caret_model_MLR_Pred)^2)
## [1] 5.838206
# RMSE
sqrt(mean((caret_testset$Prevalence - caret_model_MLR_Pred)^2))
## [1] 2.416238
Calculate \(R^2\) for entire train set (all folds) https://stackoverflow.com/questions/25691127/r-squared-on-test-data
caret_model_MLR_Pred_Train <- predict(caret_model_MLR_L2, caret_trainset)
SS.total <- sum((caret_trainset$Prevalence-mean(caret_trainset$Prevalence))^2)
SS.residual <- sum(residuals(caret_model_MLR_L2)^2)
SS.regression <- sum((caret_model_MLR_Pred_Train-mean(caret_trainset$Prevalence))^2)
# fraction of variability explained by the model
cat("\nTrain R^2 fraction of variability explained by the model :", SS.regression/SS.total )
##
## Train R^2 fraction of variability explained by the model : 0.7699436
# caret_model_MLR_L2
Calculate \(R^2\) for test set https://stackoverflow.com/questions/25691127/r-squared-on-test-data
# True y values:
# head(caret_testset$Prevalence)
# Predicated y values (y hat):
# head(caret_model_MLR_Pred)
SS.total <- sum((caret_testset$Prevalence - mean(caret_testset$Prevalence))^2)
SS.residual <- sum((caret_testset$Prevalence - caret_model_MLR_Pred)^2)
SS.regression <- sum((caret_model_MLR_Pred - mean(caret_testset$Prevalence))^2)
# NOT the fraction of variability explained by the model
test.rsq <- 1 - SS.residual/SS.total
# cat("\nNOT Test fraction of variability explained by the model :", test.rsq )
# fraction of variability explained by the model
cat("\nTest R^2 fraction of variability explained by the model :", SS.regression/SS.total )
##
## Test R^2 fraction of variability explained by the model : 0.8987872
<a href="">
<img src="" width="750" align="center">
</a>
<h3>
What to submit?
</h3>
<p>
Create predictive model for Multi Class Classification of ASD Prevalence Risk Level (Low, Medium, High, Very High) using Caret package's multinom logistic regression algorithm.
</p>
References:
https://daviddalpiaz.github.io/r4sl/the-caret-package.html
http://topepo.github.io/caret/index.html
https://cran.r-project.org/web/packages/caret/caret.pdf
# Code example of multinomial logistic regresion using Iris flower dataset
iris[c(1:3, 51:53, 101:103), ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
iris_model = train(Species ~ ., data = iris, method = "multinom",
trControl = trainControl(method = "cv", number = 5), trace = FALSE)
iris_model
## Penalized Multinomial Regression
##
## 150 samples
## 4 predictors
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 120, 120, 120, 120, 120
## Resampling results across tuning parameters:
##
## decay Accuracy Kappa
## 0e+00 0.9533333 0.93
## 1e-04 0.9600000 0.94
## 1e-01 0.9733333 0.96
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<a href="">
<img src="" width="750" align="center">
</a>
Connect with the author:
This notebook was written by GU Zhan (Sam).
Sam is currently a lecturer in Institute of Systems Science in National University of Singapore. He devotes himself into pedagogy & andragogy, and is very passionate in inspiring next generation of artificial intelligence lovers and leaders.
Copyright © 2020 GU Zhan
This notebook and its source code are released under the terms of the MIT License.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
<a href="">
<img src="" width="750" align="center">
</a>
<h3>
Interactive workshops: < Learning R inside R > using swirl() (in R/RStudio)
</h3>
https://github.com/telescopeuser/S-SB-Workshop
<a href="https://github.com/dd-consulting">
<img src="../reference/GZ_logo.png" width="60" align="right">
https://github.com/dd-consulting
</a>